dcsimg
Apache Nutch, Tika, Lucene
2 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Anonymous
Posted On:   Wednesday, February 24, 2010 04:01 AM

Hi, I understand Nutch uses Lucene as the search and index engine internally.



I am curious to how about the web-crawling feature in Nutch. Is it using Tika internally as in crawl to those web-pages and extract some meta-data, structured text out before send to Lucene for indexing.



Anyone who is well-versed in Nutch can give some advice ?



It also seem Tika is pretty new (version 0.6) cuz I cannot even have a Java Jar or even some tutorial demo on how to use it. Anyone well-versed in Tika can give some advice ?



Thanks.

Re: Apache Nutch, Tika, Lucene

Posted By:   alfredalfie  
Posted On:   Thursday, June 14, 2012 02:41 AM

In case you upgrade to Java 7, remember that you may have to reindex, as the unicode version shipped with Java 7 changed and tokenization behaves differently (e.g. lowercasing).  Rochester Attorney

Re: Apache Nutch, Tika, Lucene

Posted By:   Anonymous  
Posted On:   Wednesday, February 24, 2010 11:28 PM

This is just to give some Open Source alternatives to Tika. They are more stable and most are hosted at Sourceforge



Aperture


WebHarvest


Heritrix



Maybe can consider the folks at above projects to join Lucene ?



Thanks.

About | Sitemap | Contact