How can I handle languages where keywords can also be identifiers?

Monty Zukowski

> From: kearneym@piercom.ie
> How can I handle the strings which can be token and identifier at the
> same time.
> I have been writting the parser for the SQL, 
> and just found out that SQL has no reserved word  
> which meens that keywords can be used as ordinary identifiers.
> The first way that come acrros in my head is to use wild card 
> like below.
>     dropStatement
>         : "drop"     ("alias" | "event" "monitor")  .
>         ;
> However, if I replace all the IDENTIFIER with wild card,
> it will create ambignity warnings.

Of course you realize that this is genuinely ambiguous, right? There are two issues involved in your approach:

  1. Understanding the reasons for the ambiguities enough to know that your rules work the way they should.
  2. Understanding the code generated by antlr and knowing why it is generated and how it works.

#2 is actually easier because antlr's code is very readable and easy to follow, except for maybe the tree building stuff. #1 requires understanding lookahead and how the parser makes decisions about which alternative to take. This is explained in any text on parsing that covers LL(k) parsing, and more details are in Ter's thesis available from antlr.org.

I wrote a parser for AREV R/BASIC which had no keywords either. It had tons of ambiguity warnings but worked fine. Antlr can do it with some help. Note that the wildcard approach is a little too broad. You don't actually want to accept numbers or delimiters, etc., only literals.

Here's what I would suggest, and I'd love to hear if it works right or not and help refine the approach. Basically keep track of which tokens came from the ID rule and use that in the parser on ambiguous constructs. So:

Create your own CommonToken subclass that has an attribute called "possibleID". In your lexer define an instance variable called possibleID. Then when you create a literal like so:

    ID options { testLiterals = true; }
    	: ('a'..'z'|'A'..'Z') ('a'..'z'|'A'..'Z'|'0'..'9'|'_'|'$')*
		{ possibleID = true; } //sets flag in lexer for makeToken to

In your makeToken() method, copy the possibleID value from the lexer to the new token, and reset the lexer's possibleID to false. Then in your parser you can write things like:

        : "drop"     ("alias" | "event" "monitor")  
	    ( ID 
	    | ( ~(ID) )=>{ LT(1).possibleID() }? .

Let's look at the generated code for that last part with my comments added:

if ((LA(1)==ID)) { #the easy case, found a real ID
else {
    boolean synPredMatched6 = false;
    #test that it's not an ID and...
    if (((((LA(1) >= LITERAL_drop && LA(1) = LITERAL_monitor)))
        &&( LT(1).possibleID() ))) { #that it's a possibleID
        int _m6 = mark();
        synPredMatched6 = true;
        try {
            match(_tokenSet_0); #if so then match anything

Some things to note:

  1. I use possibleID but it's not the only way to go, you could also manually write a rule that contains _all_ of the literals in it. This is actually what I did for AREV but it was a real headache.
  2. Antlr, unlike PCCTS 1.3x, does not "hoist" syntactic or semantic predicates. That means that you can't isolate that part into a rule like this:
        ( ID 
        | ( ~(ID) )=>{ LT(1).possibleID() }? .

    The lookahead for this rule would be any token and give you the same ambiguity problems as using the wildcard operator ".".

  3. Once we know that the token is not an ID and that it is a possibleID then we know we are safe to match it with the wildcard operator. Numbers or delimiters or what-have-you will still throw exceptions if they come after your "drop" "alias" tokens.

I would truly love to come up with a sane, methodical approach to handling languages without reserved keywords. Let me know how this works so we can refine it.