How can I index PDF documents?

Otis Gospodnetic

In order to index PDF documents you need to first parse them to extract text that you want to index from them. Here are some PDF parsers that can help you with that:

PDFBox is a Java API from Ben Litchfield that will let you access the contents of a PDF document. It comes with integration classes for Lucene to translate a PDF into a Lucene document.

XPDF is an open source tool that is licensed under the GPL. It's not a Java tool, but there is a utility called pdftotext that can translate PDF files into text files on most platforms from the command line.

Based on xpdf, there is a utility called pdftohtml that can translate PDF files into HTML files. This is also not a Java application.

JPedal is a Java API for extracting text and images from PDF documents.

Simple Text Extractor Library for use with PDF documents. Relies on PDFBox.

0 Comments  (click to add your comment)
Comment and Contribute

 

 

 

 

 


(Maximum characters: 1200). You have 1200 characters left.

 

 

About | Sitemap | Contact