dcsimg

I want to parse HTML documents on the web into something understandable generically. I was thinking of using an XML parser with the HTML DTD from W3C. Does this sound sensible or am I missing something?

Davanum Srinivas

use JTidy, It can convert HTML documents into XHTML/XML:

http://lempinen.net/sami/jtidy/