Can I build an ANTLR lexer that recognizes string tokens containing Unicode, but that restricts all other tokens to be simple 7-bit ASCII encoding?

Terence Parr

Yes, but you will have to be very specify; i.e., you couldn't use wildcards or ~ (not) operators. If you specify a STRING rule that has 'u0080'..'ufffe' or some such in there, it will increase the vocabulary for the whole input stream. The wildcard would then include all that UNICODE stuff. So, you'd have to use 'a'..'z' and such in the ID rule.

A simpler way is to allow UNICODE everywhere and the catch the use of UNICODE outside of a string afterwards (or as part of the input char stream).