I have a large document (more than one megabyte) and terms appearing in the end of the document are not appearing in the index.
1 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Alex_Chaffee
Posted On:   Tuesday, April 8, 2003 07:47 AM

The IndexWriter class has a field "maxFieldLength". From the documentation: The maximum number of terms that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. By default, no more than 10,000 terms will be indexed for a field. This means that Lucene will silently discard all tokens after it has indexed 10,000 in a particular document. The solution is to remove this limitation. We have had success with writer.maxFieldLength = Integer.MAX_VALUE . In our opinion, this is    More>>

The IndexWriter class has a field "maxFieldLength". From the documentation:


The maximum number of terms that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory.


By default, no more than 10,000 terms will be indexed for a field.



This means that Lucene will silently discard all tokens after it has indexed 10,000 in a particular document.


The solution is to remove this limitation. We have had success with writer.maxFieldLength = Integer.MAX_VALUE .



In our opinion, this is a very dangerous design flaw, especially since Lucene will truncate large documents without any warning, and the searches will appear to work correctly, since some terms will work. It was only by chance that a customer noticed and reported that some documents were not appearing in the search results.



The default should be to index all tokens via Integer.MAX_VALUE, and if a user requires memory efficiency, she should be able to set a lower threshold explicitly.

   <<Less

Re: I have a large document (more than one megabyte) and terms appearing in the end of the document are not appearing in the index.

Posted By:   Alex_Chaffee  
Posted On:   Tuesday, April 8, 2003 12:15 PM

About | Sitemap | Contact