Tuesday, June 10, 2003 12:55 PM
One of the problems is that separating lexing from parsing lets you dramatically reduce your lookahead. The gcc parser has a lookahead of k=2. The lexer is k=3. But if you realize that the parser has to see the second token to decide what to do, and that the length in characters of the first token could be 256 characters for a C variable name, you would see that you need at least k=259 if there was no lexer! It's even worse if you don't set a cap on the maximum length of strings or comments. It just is not practical.
On the other hand, there are definately uses for a lexer that uses the information from the previous token to guide it to the proper subset of the next tokens. I do something like that with my "parser filter" example at http://www.codetransform.com/filterexample.html. I have to think a bit about whether antlr could be structured to use hints from the parser in the lexer. Right now the lexer knows absolutely nothing about what just was lexed unless you keep track of that yourself or use lexer states.
The problem with lexer states is that you would have to chain them. Generally you are trying to mimic a parser rule fragment in the lexer. You would need a new lexer state for each token in the flow of the rule and that gets tedious quickly.
An alternate solution is to allow the lexer to buffer tokens so that you can actually match many tokens in one fell swoop. This can be done by hand right now but it too is tedious, requiring an action after every matched token. -Monty