What is a "protected" lexer rule?

Terence Parr

A lexer is a TokenStream source that merely spits out a stream of Token objects to the parser (or another stream consumer). As such, a lexer implements method nextToken() to satisfy interface TokenStream. The parser repeatedly calls yourlexer.nextToken() to get tokens.

What token definitions result in token objects that get sent to the parser? The answer you'd expect or the one you're used to is, "You get a Token object for every lexical rule in your lexer grammar." This is indeed the default case for ANTLR's lexer grammars.

What if you want to break up the definition of a complicated rule into multiple rules? Surely you don't want every rule to result in a complete Token object in this case. Some rules are only around to help other rules construct tokens. To distinguish these "helper" rules from rules that result in tokens, use the protected modifier. This overloading of the access-visibility Java term occurs because if the rule is not visible, it cannot be "seen" by the parser.

Another, more practical, way to look at this is to note that only non-protected rules get called by nextToken() and, hence, only non-protected rules can generate tokens that get shoved down the TokenStream pipe to the parser.

I now recognize this approach as a mistake. I have a number of other proposals to fix this, none that seems to satisfy everyone.

class L extends Lexer;

/** This rule is "visible" to the parser
 *  and a Token object is sent to the
 *  parser when an INT is matched.
INT : (DIGIT)+ ;

/** This rule does not result in a token
 *  object that is passed to the parser.
 *  It merely recognizes a portion of INT.
DIGIT : '0'..'9' ;

By definition, all lexical rules return Token objects (ANTLR optimizes away many of these object creations, however), but only the Token objects of non-protected rules get pulled out of the lexer itself.