Large content and Lucene
1 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Daniel_Woodard
Posted On:   Friday, January 9, 2004 08:31 PM

I have been evaluating Lucene as a possible alternative to an existing full-text system. The most recent test I did was with a collection of roughly 1.2 millions documents. The indexing was fast and so was the searching. However, the index files are over twice the size of the actual text, i.e., 6.5GBs of text and 13.8GB of index files. My question is, does this seem right? The fields I am indexing are the default fields of the demo project, e.g., modified, contents, summary, title, uid, url. I have a collection of over 20 million documents (125GBs) that I need to make full-text searchable and if the indexes are larger than the text files, ... If you could shed some light that wou   More>>

I have been evaluating Lucene as a possible alternative to an existing full-text system.


The most recent test I did was with a collection of roughly 1.2 millions documents. The indexing was fast and so was the searching. However, the index files are over twice the size of the actual text, i.e., 6.5GBs of text and 13.8GB of index files.


My question is, does this seem right? The fields I am indexing are the default fields of the demo project, e.g., modified, contents, summary, title, uid, url.


I have a collection of over 20 million documents (125GBs) that I need to make full-text searchable and if the indexes are larger than the text files, ...


If you could shed some light that would be appreciated.

   <<Less

Re: Large content and Lucene

Posted By:   Anonymous  
Posted On:   Saturday, February 14, 2004 03:31 AM

Hi,


how are you creating the contents field? With

Field.Text(String, Reader)
or
Field.Text(String, String)
?



The first one doesn't store the data that the Reader provides, data is only tokenized and indexed. With the second one data is tokenized, indexed and stored. So if you create contents field with the second method, your index will come big..



If you are indexing your documents with your own code, be sure to optimize your index after the indexing process. Optimization is done with optimize() method in IndexWriter.



Have you read this article? Advanced Text Indexing with Lucene It may give some good ideas how to tune your indexing.


--Jouni
About | Sitemap | Contact