dcsimg
lucene and stopwords
1 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Anonymous
Posted On:   Friday, June 3, 2005 09:11 AM

int i=1; // sample code block I am doing NLP work for which I need to maintain the stopwords of a sentence. I downloaded the source code of Lucene, in class org.apache.lucene.analysis.SotpAnalyzer I replaced public static final String[] ENGLISH_STOP_WORDS = { "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "s", "such", "t", "that", "the&   More>>
			
int i=1; // sample code block


I am doing NLP work for which I need to maintain the stopwords of a sentence.

I downloaded the source code of Lucene, in class org.apache.lucene.analysis.SotpAnalyzer I replaced



public static final String[] ENGLISH_STOP_WORDS = {
"a", "an", "and", "are", "as", "at", "be", "but", "by",
"for", "if", "in", "into", "is", "it",
"no", "not", "of", "on", "or", "s", "such",
"t", "that", "the", "their", "then", "there", "these",
"they", "this", "to", "was", "will", "with"
};


with


public static final String[] ENGLISH_STOP_WORDS = {};



Then I built the new jar, put the jar into the classpath
where it should have been, and ran my software.
Unforntunately stopwords are still being removed.


My suspicion is that there is a conflict of .jar.

In the same classpath, I also have log4j-1.2.6.jar,
calling class org.apache.log4j.Logger. That jar was put
there because it is being called by other classe
(the software I am using for my experiments was originally written by someone else).



Does anyone have an idea of whether
org.apache.log4j.Logger deals in someway with
stopwords and if it could be in conflict with the changes I made in the source code of lucene ?

Thank you.

Grazia


   <<Less

Re: lucene and stopwords

Posted By:   Richard_Krenek  
Posted On:   Friday, June 3, 2005 10:55 AM

I do not think you needed to modify the StopAnalyzer. When you construct the The StopAnalyzer or StandardAnalyzer you can pass it your own zero length array of stop words.
I think this list overrides the default and is not in addition to the default. You can look at the src code to verify this.
About | Sitemap | Contact