I want to parse HTML documents on the web into something understandable generically. I was thinking of using an XML parser with the HTML DTD from W3C. Does this sound sensible or am I missing something?
Created May 4, 2012
Davanum Srinivas use JTidy, It can convert HTML documents into XHTML/XML: