dcsimg
Using Lucene to get term-document matrix
1 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   TextAnalysis
Posted On:   Wednesday, August 15, 2012 06:44 PM

Hi all, I'm trying to do a bit of text analysis with Lucene. At the moment, I'm successfully outputing the term-document matrix for the indexed corpus. However, I'm having trouble implementing one specific utility. The program returns the term-document matrix, where the terms are single words (stemmed with the porter stemmer). I need to add the ability to include important, user-specified bi and trigrams in the term-document matrix. In short, we'd like to be able to get as output a tdm where the terms are single words (ideally stemmed) + user specified word strings of length greater than 1 word. I've spent a lot of time trying to figure this out. I'm using PyLucene and unfortunately don't know a lot of Java. As such, t   More>>

Hi all,

I'm trying to do a bit of text analysis with Lucene. At the moment, I'm successfully outputing the term-document matrix for the indexed corpus. However, I'm having trouble implementing one specific utility.

The program returns the term-document matrix, where the terms are single words (stemmed with the porter stemmer). I need to add the ability to include important, user-specified bi and trigrams in the term-document matrix. In short, we'd like to be able to get as output a tdm where the terms are single words (ideally stemmed) + user specified word strings of length greater than 1 word.

I've spent a lot of time trying to figure this out. I'm using PyLucene and unfortunately don't know a lot of Java. As such, the underlying Java code is a tedious read that I've avoided as much as possible.

Is there an easy way to do this? Should I include some code samples?

Thanks in advance.

   <<Less

Re: Using Lucene to get term-document matrix

Posted By:   praveenkumar2011  
Posted On:   Wednesday, August 22, 2012 07:02 AM

Hi all,

I'm trying to do a bit of text analysis with Lucene. At the moment, I'm successfully outputing the term-document matrix for the indexed corpus. However, I'm having trouble implementing one specific utility.

http://www.medicalrecruitmentagency.com  

About | Sitemap | Contact