jGuru
Register Email     Password Forgot your
password?
HOME FAQS FORUMS DOWNLOADS ARTICLES PEERSCOPE LEARN

  Search   jGuru Search Help

Question Can Lucene index PDF documents?
Topics Tools:Search:Lucene
Author Otis Gospodnetic PREMIUM
Created May 1, 2002


Answer
Lucene can index anything that can be converted to String and fed to it through its API. See Lucene Contributions for some pointers.

Is this item helpful?  yes  no     Previous votes   Yes: 0  No: 1



Comments and alternative answers

Comment on this FAQ entry

i2a websearch
Eelco Cramer PREMIUM, May 29, 2002  [replies:4]
i2a websearch is a search engine based on Lucene. It claims to be able to search in PDFs.

Didn't test it but it might be helpfull.

http://www.i2a.com/websearch/

Is this item helpful?  yes  no     Previous votes   Yes: 1  No: 0



Reply to this answer/comment  Help  

Re: i2a websearch
Richard Burton, Dec 14, 2002  [replies:1]
I would use Runtime.getRuntime("pdf2text") and capture the output stream from the forked shell. There is only one issue, and that is pdf2text sometimes extracts binary data from the pdf also. But it extracts all of the text which is what you want (Along with a little bit of trash). I hope this helps.

Is this item helpful?  yes  no     Previous votes   Yes: 0  No: 0



Reply to this answer/comment  Help  
Re[2]: i2a websearch
Weldon Sams, Jul 14, 2004
You spoke about pdf2text, could you possibly explain where the Runtime.getRuntime("pdf2text") comes from, or what steps I should take to index a bunch of pdfs with Lucene. I'm very new to Lucene, so I'm not up to speed with everything yet. Also, would you know of any good help documents for Lucene.
Thanks


Is this item helpful?  yes  no     Previous votes   Yes: 0  No: 0



Reply to this answer/comment  Help  
Re: i2a websearch
Matthieu Casanova, Feb 12, 2004  [replies:1]
You can also use PDF Box a java api for PDF. There is also a class that parse PDF and returns Lucene Document, it works great and it's Free http://www.pdfbox.org/

Is this item helpful?  yes  no     Previous votes   Yes: 0  No: 0



Reply to this answer/comment  Help  
Re[2]: i2a websearch
Kalani Ruwanpathirana, Aug 26, 2008
Yes PDFBox works very well with Lucene. I have worked with it. This post shows how to do it. http://kalanir.blogspot.com/2008/08/indexing-pdf-documents-with-lucene.html

Is this item helpful?  yes  no     Previous votes   Yes: 0  No: 0



Reply to this answer/comment  Help  
Try PDFTextStream
Chas Emerick, Sep 7, 2004
PDFTextStream goes one step further than just extracting text from PDF files to be used with Lucene -- it provides a complete set of integration classes that enables a Lucene user to easily add PDF document content to Lucene indexes. There's a full tutorial and sample code available: PDFTextStream / Lucene Integration

Is this item helpful?  yes  no     Previous votes   Yes: 0  No: 0



Reply to this answer/comment  Help  
Can Lucene index PDF documents?
Dharmanand Singh, Nov 2, 2004  [replies:1]
Yes definitely it can be done. There are many libraries that support this. One library that also has support for lucene is: http://www.pdfbox.org/
You can also download an example that parses and searches PDF documents: http://dharmanand.tarundua.net/lucene_eg.war.
You can find some details on: http://dharmanand.tarundua.net/.


Is this item helpful?  yes  no     Previous votes   Yes: 0  No: 0



Reply to this answer/comment  Help  
Re: Can Lucene index PDF documents?
durot durman, Nov 25, 2004
View http://www.jguru.com/faq/view.jsp?EID=1074237

Is this item helpful?  yes  no     Previous votes   Yes: 0  No: 0



Reply to this answer/comment  Help  


Ask A Question



 
Related Links

Lucene FAQ

Lucene Forum

Lucene Homepage

Wish List
Features
About jGuru
Contact Us

 


Internet.com
The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers