Does Lucene allows to split a document tokens between several indexes?
0 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Reza_Zakeri
Posted On:   Sunday, September 6, 2009 06:14 AM

Dear All, (1) I'm trying to make some changes in Lucene Java to provide some better support for Farsi.The problem I'm concerned now is searching for both simplified (accent removed) and original form of a phrase in the same time. To do this, I'm going to store the original form of the file content in a primary index, and store just the accent-removed form of accented words in a little secondary index. By this way, I'll use a simple index searcher for the exact form search and a multiple index searcher for the accent-removed form search. Now my question is does Lucene allows to split a document tokens between several indexes? And if so, whether it can load the different parts from the indexes into one internal docu   More>>

Dear All,


(1)

I'm trying to make some changes in Lucene Java to provide some better support for Farsi.The problem I'm concerned now is searching for both simplified (accent removed) and original form of a phrase in the same time. To do this, I'm going to store the original form of the file content in a primary index, and store just the accent-removed form of accented words in a little secondary index. By this way, I'll use a simple index searcher for the exact form search and a multiple index searcher for the accent-removed form search.


Now my question is does Lucene allows to split a document tokens between several indexes? And if so, whether it can load the different parts from the indexes into one internal document object?


(2)

I need to make some changes to the index creation process, too. I think it's an error-prone process if I try to change the whole hierarchy of calls initiated by the IndexWriter.addDocument method. It will be much easier if I could just ask the secondary index writer to store the required tokens (which are simplified form of accented words).


To do this, I first created extensions over Token, Tokenizer and Filter classes which let me hold two string values for each token. By the way, as the tokenizing process take place deep inside IndexWriter.addDocument call hierarchy, this seems useless. Also if I preprocess the extracted text and just send the desired words to the secondary index writer, it will cause misdata in position, etc. Any idea about how to do that?


Regards,

Zakeri

   <<Less
About | Sitemap | Contact