Can I read the HTML document in Java?
6 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Manujender_Reddy
Posted On:   Tuesday, June 19, 2001 03:07 PM

I have situation here where in I would have to read the TABLE data from the HTML file.

Is it possible to do that in Java?

Re: Can I read the HTML document in Java?

Posted By:   Anonymous  
Posted On:   Thursday, August 5, 2010 02:44 AM

Use HTML UNIT for this work its very easy and good approch

Re: Can I read the HTML document in Java?

Posted By:   Anonymous  
Posted On:   Tuesday, April 6, 2010 12:59 PM

Write a simple string parser that will read as a string the contents of the html document( wherever the file is located open and read it with java file processing techniques.),then parse the string by
searching for the table tags and scan the row contents into tokens.Store each value in a suitable data structure e.g a java.util.ArrayList object and use as you wish.
Thanks.

Re: Can I read the HTML document in Java?

Posted By:   Will_Yates  
Posted On:   Thursday, March 25, 2004 08:10 AM

Hi, Yes it will be possible to read the table data. Simply use a fileReader object to read the file. Then simply parse the strings as you read them in order to locate the table tags

Re: Can I read the HTML document in Java?

Posted By:   Aleksei_Valikov  
Posted On:   Thursday, January 30, 2003 05:08 AM

My ultimate advise would be to use JTidy, http://sourceforge.net/projects/jtidy.
It is a very neat tool, which is capable of "correcting" syntactical mistakes in HTML files and outputting them as DOM.

After that I would an XPath package to retrieve required part of the document.

I thought JTidy is de-facto standard for tasks like this, but noone mentioned it.

Re: Can I read the HTML document in Java?

Posted By:   Chandra_Patni  
Posted On:   Thursday, June 28, 2001 07:02 PM

You can use the HTML parser used by javax.swing.JEditorPane. The parser works on callback principle similar to a SAX parser in XML world. All you need to do is write a callback handler by extending HTMLEditorKit.ParserCallback inner class of javax.swing.text.html package. A typical use case would be as follows. I have provided a skeleton of handler with some useful comments.


java.io.Reader reader = ... // from your source file, url etc
javax.swing.text.html.parser.DocumentParser parser =
new javax.swing.text.html.parser.DocumentParser(DTD.getDTD("html"));
parser.parse(reader, new CallbackHandler(), false);

....
....

class CallbackHandler extends HTMLEditorKit.ParserCallback {

/**
* Called by the parser when text is encountered. To handle the text
* passed by parser.
*/
public void handleText(char[] data, int pos) {
// deal with the text here...
}

/**
* The method is called by the parser when start of the simple tag is
* encountered.
*/
public void handleSimpleTag(HTML.Tag t, MutableAttributeSet a, int pos) {
// called when TABLE, TR, TD tags are encountered
// handling TR tag of table
if(t == HTML.Tag.TR) {
// do something useful
return;
}
if(t == HTML.Tag.TR) {
// do something useful
return;
}
}

/**
* Called by the parser when start of the tag is encountered.
*/
public void handleStartTag(HTML.Tag t, MutableAttributeSet a, int pos) {
// surprisingly P is called here while TABLE is a simple tag
if(t == HTML.Tag.P) {
// do something...
return;
}
}
/**
* Called by the parser when end of a tag is encountered.
*/
public void handleEndTag(HTML.Tag t, int pos) {
// this is never called in my experience
}
}
}

Re: Can I read the HTML document in Java

Posted By:   daniel_walker  
Posted On:   Wednesday, June 20, 2001 04:00 PM

Yes.

You could either parse the HTML document yourself, or use one of the many XML parsers....the second would probably make life a little easier.
About | Sitemap | Contact