'"' could mean String literal or delimited expression. How to lex?
1 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Adam_McClure
Posted On:   Friday, August 30, 2002 12:14 PM

I keep staring at the stream multiplexing info thinking the solution is somewhere in there, but I'm stumped. Here's the problem: The grammar for XQuery specifies two meanings for strings. First there is the typical " or '-delimited strings composed of characters and whitespace. The correct behavior is to recognize the first quote character, slurp in all characters until you see another quote, take the trailing quote, and return the token. The second meaning is within an XML attribute. Suppose you have code like: Well, the lexer will parse the entire expression as a single string when what we really want is to parse within the stri   More>>

I keep staring at the stream multiplexing info thinking the solution is somewhere in there, but I'm stumped.


Here's the problem: The grammar for XQuery specifies two meanings for strings.


First there is the typical " or '-delimited strings composed of characters and whitespace. The correct behavior is to recognize the first quote character, slurp in all characters until you see another quote, take the trailing quote, and return the token.


The second meaning is within an XML attribute. Suppose you have code like:






Well, the lexer will parse the entire expression as a single string when what we really want is to parse within the string and not only validate the enclosed expression, but generate an AST tree reflecting the content of the enclosed expression.


Thoughts?

   <<Less

Re: '"' could mean String literal or delimited expression. How to lex?

Posted By:   Adam_McClure  
Posted On:   Friday, August 30, 2002 01:26 PM

I'm going to answer my own question here in hopes others will find it illustrative.


The main issue was that there was no disambiguating character. The '"' and '\'' characters were both given two different syntactic meanings. Therefore there was no single token that could be recognized to switch lexers as part of a multiplexing scheme.


So I had to take a different route....


The solution was to refactor the code to recognize quotes as their own lexer rule with a subrule to recognize escaped quotes ("""" and "\'\'") and return the appropriate token.


DQUOTE : '"' ('"' {$setType(ESC_DQUOTE);})? ;

QUOTE : '\' ('\'' {$setType(ESC_QUOTE);})? ;


Then I went up to the parser layer and defined a rule called 'string' like this:


string :

DQUOTE (ESC_DQUOTE | {LA(1)!=DQUOTE}? .)* DQUOTE

| QUOTE (ESC_QUOTE | {LA(1)!=QUOTE}? .)* QUOTE

;


Works like a charm. When I want to parse the string completely, I simply recognize DQUOTE or QUOTE and let the parser continue with subrules and follow up with a matching quote token (e.g. DQUOTE mySubrule DQUOTE). When I want a string literal I use the 'string' parser rule.


Voila!

About | Sitemap | Contact