Parser driven lexing
1 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   David_Sundstrom
Posted On:   Monday, June 9, 2003 12:52 PM

I inherited a projected based on PCCTS a few years back, and have since implemented several parsers based on PCCTS and now ANTLR. Through it all I have always wondered why parsing and lexing are distinct and separate steps. During lexical analysis, there are ambiguities that could be resolved if there were parsing context. In other words, at any point in a parse, the set of tokens I expect next are a subset of all tokens, and two tokens which might otherwise be mutually ambiguous never share the same subset. There is a similar FAQ entitled "Why can't the parser tell the lexer exactly what to go get?", however, it didn't exactly answer this question. The parser would not inform th   More>>

I inherited a projected based on PCCTS a few years back, and have since implemented several parsers based on PCCTS and now ANTLR.

Through it all I have always wondered why parsing and lexing are distinct and separate steps. During lexical
analysis, there are ambiguities that could be
resolved if there were parsing context. In other words,
at any point in a parse, the set of tokens I expect next
are a subset of all tokens, and two tokens which might
otherwise be mutually ambiguous never share the same subset.

There is a similar FAQ entitled "Why can't the parser tell the lexer exactly what to go get?", however, it didn't
exactly answer this question. The parser would not inform the lexer of the exact next token, but instead the set of all possible tokens that would satisfy any of the current parser alternatives.

Is such an approach practical, or do implementation
reasons dictate completely separate lexer and parse
passes?

-David

   <<Less

Re: Parser driven lexing

Posted By:   Monty_Zukowski  
Posted On:   Tuesday, June 10, 2003 12:55 PM

One of the problems is that separating lexing from parsing lets you dramatically reduce your lookahead. The gcc parser has a lookahead of k=2. The lexer is k=3. But if you realize that the parser has to see the second token to decide what to do, and that the length in characters of the first token could be 256 characters for a C variable name, you would see that you need at least k=259 if there was no lexer! It's even worse if you don't set a cap on the maximum length of strings or comments. It just is not practical.


On the other hand, there are definately uses for a lexer that uses the information from the previous token to guide it to the proper subset of the next tokens. I do something like that with my "parser filter" example at http://www.codetransform.com/filterexample.html. I have to think a bit about whether antlr could be structured to use hints from the parser in the lexer. Right now the lexer knows absolutely nothing about what just was lexed unless you keep track of that yourself or use lexer states.

The problem with lexer states is that you would have to chain them. Generally you are trying to mimic a parser rule fragment in the lexer. You would need a new lexer state for each token in the flow of the rule and that gets tedious quickly.


An alternate solution is to allow the lexer to buffer tokens so that you can actually match many tokens in one fell swoop. This can be done by hand right now but it too is tedious, requiring an action after every matched token. -Monty

About | Sitemap | Contact