NEWBIE Question: Trying to loose parse HTML. How do I get the Parser and Lexer to interact?
1 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Colin_MacLeod
Posted On:   Friday, February 1, 2002 07:50 AM

I'm trying to loose parse HTML, so that I can re-format it. I would like to parse the html as a sequence of tags, comments and text and retain all of these. I don't want to match open and closing tags, but I do want to parse attributes within tags Not being a parser buff, I imagine I'd like to create something like a linked list of "Tag" objects, with the attributes as values set in a java.utl.Properties instance in each tag. There could be another class for comments (in the same linked list), and a third for unformatted text. I could then make a walker class which goes thro' and makes sense of it all, spitting it out again nice and tidy. So much for my imagination. I looked long and hard at    More>>

I'm trying to loose parse HTML, so that I can re-format it.



I would like to parse the html as a sequence of tags, comments and text and retain all of these. I don't want to match open and closing tags, but I do want to parse attributes within tags



Not being a parser buff, I imagine I'd like to create something like a linked list of "Tag" objects, with the attributes as values set in a java.utl.Properties instance in each tag. There could be another class for comments (in the same linked list), and a third for unformatted text.



I could then make a walker class which goes thro' and makes sense of it all, spitting it out again nice and tidy.



So much for my imagination. I looked long and hard at the HTML grammar which comes with ANTLR and used it and all the documenation I can get my hands on to put together the code below.



The Lexer code now accepts my input nicely. The lexer passes a stream of tag or text tokens to the parser. However, since I used a protected token for the attributes, how can I 'see' them?. Is it possible to use this to construct the kind of list model described in the previous paragraph? Or am I barking up the proverbial wrong tree, completely?



			
header {
//package com.ivata.intranet.Html.parser;
}

// Import the necessary classes
{
import java.io.*;
import antlr.RecognitionException;
import antlr.TokenStreamException;
}


//-----------------------------------------------------------------------------
// Define a Parser, calling it HtmlParser
//-----------------------------------------------------------------------------
class HtmlParser extends Parser;
options {
// tokenVocabulary=Html; // Call its vocabulary "XL"
// codeGenMakeSwitchThreshold = 2; // Some optimizations
// codeGenBitsetTestThreshold = 3;
defaultErrorHandler = true; // Don't generate parser error handlers
k=1;
// buildAST = true; // uses CommonAST by default
}


// Define some methods and variables to use in the generated parser.
{
public static void main(String[] args) {
try {
System.out.println( "Please enter your test..." );
HtmlLexer lexer = new HtmlLexer(new DataInputStream(System.in));
lexer.setFilename(" ");
HtmlParser parser = new HtmlParser(lexer);
parser.setFilename(" ");
// Parse the input expression
parser.document( );
}
catch(TokenStreamException e) {
System.err.println("exception: "+e);
}
catch(RecognitionException e) {
System.err.println("exception: "+e);
}
}
}


// for our purposed, a document is just a series of tags, comments or text
// it is kept this flexible to allow any strange combination:
// the tree walker will make sense of it all later
document
:
(tag|comment|text)*
// end-of-file
;

tag
:
OPEN_TAG | CLOSED_TAG
;

comment
:
COMMENT
;

text
:
PCDATA
;

//----------------------------------------------------------------------------
// The Html scanner
//----------------------------------------------------------------------------
class HtmlLexer extends Lexer;



options {
charVocabulary = '\0'..'377';
// tokenVocabulary=Html; // call the vocabulary "Html"
testLiterals=false; // don't automatically test for literals
k=2;
}


tokens {
OPEN_TAG; CLOSED_TAG; EMPTY_TAG;
}

protected
HEX_NUMBER
: '#' HEX_INTEGER
;

protected
NUMBER
:
('-')? INTEGER
;

protected
INTEGER
: ('0'..'9')+
;

protected
HEX_INTEGER
: (
/* Technically, HEXINT cannot be followed by a..f, but due to our
loose grammar, the whitespace that normally would follow this
rule is optional. ANTLR reports that #4FACE could parse as
HEXINT "#4" followed by WORD "FACE", which is clearly bogus.
ANTLR does the right thing by consuming a much input as
possible here. I shut the warning off.
*/
options {
generateAmbigWarnings=false;
}
: '0'..'9'
| 'a'..'f'
| 'A'..'F'
)+
;

protected
STRING
: '"' (~'"')* '"'
| '\'' (~'\'')* '\''
;


protected
IDENTIFIER
: (
('a'..'z'|'A'..'Z')
| '.'
)

(
/* In reality, a WORD must be followed by whitespace, '=', or
what can follow an ATTR such as '>'. In writing this grammar,
however, we just list all the possibilities as optional
elements. This is loose, allowing the case where nothing is
matched after a WORD and then the (ATTR)* loop means the
grammar would allow "widthheight" as WORD WORD or WORD, hence,
an ambiguity. Naturally, ANTLR will consume the input as soon
as possible, combing "widthheight" into one WORD.

I am shutting off the ambiguity here because ANTLR does the
right thing. The exit path is ambiguous with ever
alternative. The only solution would be to write an unnatural
grammar (lots of extra productions) that laid out the
possibilities explicitly, preventing the bogus WORD followed
immediately by WORD without whitespace etc...
*/
options {
generateAmbigWarnings=false;
}
: ('a'..'z'|'A'..'Z')
| ('0'..'9')
| '.'
)*
;


protected TAG_START : ' <' ;
protected TAG_END : '>' ;
protected TAG_CLOSE : '/' ;


// multiple-line comments
protected
COMMENT_DATA
: ( /* '
' '
' can be matched in one alternative or by matching
'
' in one iteration and '
' in another. I am trying to
handle any flavor of newline that comes in, but the language
that allows both "
" and "
" and "
" to all be valid
newline is ambiguous. Consequently, the resulting grammar
must be ambiguous. I'm shutting this warning off.
*/
options {
generateAmbigWarnings=false;
}
:
{!(LA(2)=='-' && LA(3)=='>')}? '-' // allow '-' if not "-->"
| '
' '
' {newline();}
| '
' {newline();}
| '
' {newline();}
| ~('-'|'
'|'
')
)*
;

COMMENT
:
" "! (WS)?
;
// ignore all white space
WS : (
/* '
' '
' can be matched in one alternative or by matching
'
' in one iteration and '
' in another. I am trying to
handle any flavor of newline that comes in, but the language
that allows both "
" and "
" and "
" to all be valid
newline is ambiguous. Consequently, the resulting grammar
must be ambiguous. I'm shutting this warning off.
*/
options {
generateAmbigWarnings=false;
}
: ' '
| ' '
| '
' { newline(); }
| "
" { newline(); }
| '
' { newline(); }
)+
{ $setType(Token.SKIP); }
;

protected
ATTRIBUTE_VALUE
:
'='! (WS)? (IDENTIFIER ('%')? | ('-')? INTEGER | STRING | HEX_NUMBER) (WS)?
;
protected ATTRIBUTE
:
a:IDENTIFIER
( ATTRIBUTE_VALUE )?
;

OPEN_TAG
:
t:TAG_START (WS)? IDENTIFIER (WS (ATTRIBUTE)*)? ( TAG_CLOSE )? TAG_END
;

CLOSE_TAG
:
t:TAG_START TAG_CLOSE TAG_END
;

PCDATA
options {testLiterals=true;}
: (
/* See comment in WS. Language for combining any flavor
* newline is ambiguous. Shutting off the warning.
*/
options {
generateAmbigWarnings=false;
}
: ~(' <'|'
'|'
'|'"'|' '|' '|'>')
)+
;

   <<Less

Re: NEWBIE Question: Trying to loose parse HTML. How do I get the Parser and Lexer to interact?

Posted By:   Terence_Parr  
Posted On:   Saturday, February 2, 2002 09:58 AM

Your ""imagination""/intuition is correct. You want to have the lexer match a bunch of tags and text and then have a simple problem check for balance and/or do formatting. The html sanitizer that is accepting the text of this response does precisely that.



The text of protected lexer methods are typically included in the token that invokes them. If you print out the text for a complicated token you should see all the args unless you explicitly with ""!"" tell the lexer to toss out text.
About | Sitemap | Contact