dcsimg
Breaking up search and index by a period "."
1 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   David_Parish
Posted On:   Thursday, February 3, 2005 07:12 AM

I index a large subset of technical support issues. Many of those issues contain exceptions. A simple example is this: java.lang.OutOfMemoryError The problem is that someone searching may search for just "OutOfMemoryError". What I found is that this will not return a document whose contents is actually "java.lang.OutOfMemoryError". At first I thought I could just make all queries wild card queries, but I came across this in the javadocs for WildcardQuery: In order to prevent extremely slow WildcardQueries, a Wildcard term must not start with one of the wildcards * or ?. Since my wildcard query would have to be "*OutOfMemoryError", that restriction breaks me.    More>>

I index a large subset of technical support issues. Many of those issues contain exceptions. A simple example is this:


java.lang.OutOfMemoryError


The problem is that someone searching may search for just "OutOfMemoryError". What I found is that this will not return a document whose contents is actually "java.lang.OutOfMemoryError". At first I thought I could just make all queries wild card queries, but I came across this in the javadocs for WildcardQuery:

In order to prevent extremely slow WildcardQueries, a Wildcard term must not start with one of the wildcards * or ?.

Since my wildcard query would have to be "*OutOfMemoryError", that restriction breaks me.


I assume the solution is to write a tokenizer that breaks by the period "." I thought this already existed in the Standard Analyzer. Is there a simple solution to this problem? What would I need to do to the Analyzer to break the terms up by a period?


Thanks,

-Dave

   <<Less

Re: Breaking up search and index by a period "."

Posted By:   Otis_Gospodnetic  
Posted On:   Friday, February 4, 2005 08:53 AM

Look at the source of Lucene's WhitespaceTokenizer.java. You would need to create your own class, say DotTokenizer, with its own isTokenChar method which tests whether the given char is a '.'.
About | Sitemap | Contact