How can I index HTML documents?

Otis Gospodnetic

In order to index HTML documents you need to first parse them to extract text that you want to index from them. Here are some HTML parsers that can help you with that:

An example that uses JavaCC to parse HTML into Lucene Document objects is provided in the Lucene web application demo that comes with the Lucene distribution.

The CyberNeko HTML Parser lets you parse HTML documents. It's relatively easy to remove most of the tags from an HTML document (or all if you want), and then use the ones you left in to help create metadata for your Lucene document. NekoHTML also provides a DOM model for navigating through the HTML.

JTidy cleans up HTML, and can provide a DOM interface to the HTML files through a Java API.