Using Lucene to get term-document matrix
1 posts in topic
Thread View Thread View
TOPIC ACTIONS:
 

Using Lucene to get term-document matrix ...
TextAnalysis
Wed Aug 15, 2012 06:44 PM

Hi all,

I'm trying to do a bit of text analysis with Lucene. At the moment, I'm successfully outputing the term-document matrix for the indexed corpus. However, I'm having trouble implementing one specific utility.

The program returns the term-document matrix, where the terms are single words (stemmed with the porter stemmer). I need to add the ability to include important, user-specified bi and trigrams in the term-document matrix. In short, we'd like to be able to get as output a tdm where the terms are single words (ideally stemmed) + user specified word strings of length greater than 1 word.

I've spent a lot of time trying to figure this out. I'm using PyLucene and unfortunately don't know a lot of Java. As such, the underlying Java code is a tedious read that I've avoided as much as possible.

Is there an easy way to do this? Should I include some code samples?

Thanks in advance.

  Hi all, I'm t...

praveenkumar2011
Wed Aug 22, 2012 07:02 AM

Hi all,

I'm trying to do a bit of text analysis with Lucene. At the moment, I'm successfully outputing the term-document matrix for the indexed corpus. However, I'm having trouble implementing one specific utility.

http://www.medicalrecruitmentagency.com  

About | Sitemap | Contact