dcsimg
Struggling with ambiguity inferred by lexing/parsing :( Why doesn't my lexer recognize a token that is a superstring of another token?
0 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   valentin_tihomirov
Posted On:   Wednesday, December 22, 2004 08:43 AM

The question is related to handling languages where keywords can also be identifiers. Separating a grammar into lexer and parser parts has its advantages but it turns absolutely valid CF grammar into an ambigous one. For ex, here is a language in LISP notation: plus(start): LPAREN "plus" a:ID b:ID RPAREN; ID: ('a'..'z')+; LPAREN: '('; RPAREN: ')'; the input '(plus plus A)' becomes ambigous in the lexer. This denies using "plus" name as identifier middling language specifics (keywords) into semantic level (variable names). Another option would be to join tag/element names with the opening symbol LPAREN. plus: "(plus" a:ID b:ID RPAREN; OHTML: " "    More>>

The question is related to handling languages where keywords can also be identifiers. Separating a grammar into lexer and parser parts has its advantages but it turns absolutely valid CF grammar into an ambigous one. For ex, here is a language in LISP notation:

			
plus(start): LPAREN "plus" a:ID b:ID RPAREN;
ID: ('a'..'z')+;
LPAREN: '(';
RPAREN: ')';

the input '(plus plus A)' becomes ambigous in the lexer. This denies using "plus" name as identifier middling language specifics (keywords) into semantic level (variable names). Another option would be to join tag/element names with the opening symbol LPAREN.
			
plus: "(plus" a:ID b:ID RPAREN;
OHTML: " "
CHTML: " "

This is how example HTML grammar is given in the ANTLR site. This approach seems well-grounded as no identifier can start from LPAREN (for instance, "(plus" and "
			
bypass: LPAREN (ID | bypass)* RPAREN;

rule needed to traverse optional elements which should be ignored. Here is how the problem raises. The plus grammar above expects starting symbol "(plus" in the beginning and I do provide it in the (plus 10 15) input string; however, the lexer returns LPAREN token which fails the parsing. This is defenetely a mistake as the grammar cannot produce any LPAREN starting sentences, they all start by "(plus" literal. The k=10 used in lexer. What is the recommended workaround to enable keywords-identifiers for a LISP-like language? Can the parser and lexer be used together in order to describe a single grammar preventing appearance of the ambiguity? Thanks.    <<Less
About | Sitemap | Contact