How can I use multiple lexers to scan my input (how do I get "lexer states")?

Terence Parr

Complicated input languages often have a few lexical constructs that are really nested lexical structures such as Java's javadoc comments, regular comments, or even just strings. In other words, the allowable characters and structure within the comment or string is radically different from the input surrounding that lexeme.

In automaton-based lexers, you use multiple states to handle these situations. Upon "open comment", you switch to a different state and then have that state switch you back to the normal state upon "close comment". This is like launching another lexer to handle the nested lexical structure.

ANTLR's approach is to do just that: have a different lexer for each nested lexical substructure. You can even have a nested substructure invoke yet another, for a method-invocation like stacking. All lexers operating on the same input stream share the same "shared input structure" object; they share the same lookahead, line number, etc... Please see Token Stream Multiplexing for a complete description and example of how to use multiple lexers.

Note that sometimes you should just call another lexical rule instead of invoking another lexer. The golden rule of when to and when not to use multiple lexers that are combined (multiplexed) into a single token stream:

Complicated single tokens should be matched by calling another (protected) lexer rule whereas streams of tokens from diverse slices or sections should be handled by different lexers multiplexed onto the same stream that feeds the parser.
About | Sitemap | Contact