Saturday, October 27, 2001 07:43 PM
It is hard to say what is the best or recommended way of indexing HTML, it really depends on what you want to achieve.
For extracting text and different types of elements from HTML you can use JTidy which can be found on SourceForge.
Lucene also comes with a demo that indexes HTML that you can take a look at.
This is not a great answer, but I hope it helps.