Indexing HTML
1 posts in topic
Flat View  Flat View

Posted By:   Erik_Hatcher
Posted On:   Saturday, October 27, 2001 05:53 AM

and as two separate fields and stripping out HTML tags as well.

What is the best (recommended) way of indexing HTML documents with Lucene? Ideally it should pull out

Re: Indexing HTML

Posted By:   Otis_Gospodnetic  
Posted On:   Saturday, October 27, 2001 07:43 PM

It is hard to say what is the best or recommended way of indexing HTML, it really depends on what you want to achieve.

For extracting text and different types of elements from HTML you can use JTidy which can be found on SourceForge.
Lucene also comes with a demo that indexes HTML that you can take a look at.

This is not a great answer, but I hope it helps.


About | Sitemap | Contact