I'm writing a parser for a new language, and I want to be able to skip over code I haven't defined yet.
1 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Tim_McKenzie
Posted On:   Thursday, January 17, 2002 03:53 PM

I'd like to parse a complete file of the language I'm interested in, without a completed parser. This way I can test the parts I have done incrementally. Where I come to a line I haven't yet made a completed rule for, I use a rule in the parser like: dummy_rule: //scarf until semicolon is found ( options { greedy=false; } : . //match any lexer rule )* //non-greedy == break out of the loop if possible SEMICOLON_TOKEN ; So the parser should just ignore any tokens it matches until it finds a semicolon (defined in the lexer) In the lexer, I have a rule defining an identifier    More>>

I'd like to parse a complete file of the language I'm interested in, without a completed parser. This way I can test the parts I have done incrementally.


Where I come to a line I haven't yet made a completed rule for, I use a rule in the parser like:


dummy_rule: //scarf until semicolon is found

( options { greedy=false; } :

. //match any lexer rule

)* //non-greedy == break out of the loop if possible

SEMICOLON_TOKEN

;


So the parser should just ignore any tokens it matches until it finds a semicolon (defined in the lexer)


In the lexer, I have a rule defining an identifier



IDENTIFIER

// check tokens{} section (and any other parser string) before applying this rule

options {testLiterals=true;}

:

('a'..'z'|'A'..'Z')

//no two LOW_LINE_TOKENS in a row

( ('_')? ('a'..'z'|'A'..'Z'|'0'..'9') )*

;


However, when I parse textual input containing an apostrophe, also defined as a token


PERCENT_SIGN_TOKEN : '%';

AMPERSAND_TOKEN : '&';

APOSTROPHE_TOKEN : '\'';

LEFT_PARENTHESIS_TOKEN : '(';

RIGHT_PARENTHESIS_TOKEN : ')';


I get a run-time error


parser exception: antlr.TokenStreamRecognitionException: unexpected char: '

antlr.TokenStreamRecognitionException: unexpected char: '

at rp.RosettaLexer.nextToken(RosettaLexer.java:533)

at antlr.TokenBuffer.fill(TokenBuffer.java:61)

at antlr.TokenBuffer.LA(TokenBuffer.java:70)

at antlr.LLkParser.LA(LLkParser.java:50)

at rp.RosettaParser.facet_term(RosettaParser.java:565)

at rp.RosettaParser.facet_definition(RosettaParser.java:371)

at rp.RosettaParser.facet_declaration(RosettaParser.java:178)

at rp.RosettaParser.design_unit(RosettaParser.java:146)

at rp.RosettaParser.design_file(RosettaParser.java:122)

at rp.RosettaParser.parseFile(RosettaParser.java:81)

at rp.RosettaParser.doFile(RosettaParser.java:67)

at rp.RosettaParser.doFile(RosettaParser.java:59)

at rp.RosettaParser.main(RosettaParser.java:39)


I also have a rule defining an apostrophe delimited character, ('g') and if I alter the input stream to include apostrophes in only this format, there are no problems.


QUESTION: Why can't I skip single semicolons with the '.' wildcard?


Thanks


Tim

   <<Less

Re: I'm writing a parser for a new language, and I want to be able to skip over code I haven't defined yet.

Posted By:   Monty_Zukowski  
Posted On:   Friday, January 18, 2002 06:55 AM

Your parser is not at fault, your lexer is. You need to refactor your rule for "'g'" to include the common prefix "'". I'm embarrassed to say this isn't a section of the lexer documentation. Take a look at the Java or C grammar's rule for numbers. They have folded in ints, floats and plain '.' because they share a common prefix which cannot be disambiguated by increasing "k", the lookahead depth.


By the way I would strongly recommend a unit test approach to getting your parser off the ground. See http://groups.yahoo.com/group/antlr-interest/files/tester.zip for a toolkit to help.


My first step would be to complete the lexer. You can lex your entire program by instantiating your lexer and calling nextToken() until you get EOF.


For the parser I would write tests for individual rules as I write the rules. The difficulty in writing a parser is that the rules interact with one another, and getting a lone rule correct doesn't mean it will work properly in the grammar as a whole. Once you start tweaking to eliminate ambiguities you will want to have unit tests to make sure you don't mess up the basics.


When I code antlr grammars I check in the grammar with every compile. It's notoriously easy to change too much at once, get a ton of problems, and then forget what you recently changed. By frequently checking in I can easily backtrack to what was working an hour ago.


Have fun.

About | Sitemap | Contact