Posted By:
Anonymous
Posted On:
Wednesday, February 24, 2010 04:01 AM
Hi, I understand Nutch uses Lucene as the search and index engine internally.
I am curious to how about the web-crawling feature in Nutch. Is it using Tika internally as in crawl to those web-pages and extract some meta-data, structured text out before send to Lucene for indexing.
Anyone who is well-versed in Nutch can give some advice ?
It also seem Tika is pretty new (version 0.6) cuz I cannot even have a Java Jar or even some tutorial demo on how to use it. Anyone well-versed in Tika can give some advice ?
Thanks.