PDFBox problem
2 posts in topic
Flat View  Flat View

Posted By:   Alfred_Sniff
Posted On:   Monday, April 18, 2005 02:23 AM


I work with PDFBox 7.1 in order to index my pdf document. I go like this :

doc = LucenePDFDocument.getDocument(savedFile);

First i thought that work, but many words which are in the document are not found when i make my research... It works with some words but not every words...

I heard about PdfToText but I don't know if it works better. If the results are better, what is the code line to index a pdf with pdftotext?


Re: PDFBox problem

Posted By:   Otis_Gospodnetic  
Posted On:   Monday, April 18, 2005 08:06 PM

Alfred, you should use Luke to look at your index and see what got indexed. You could also grab the code from Lucene in Action, which includes code for indexing PDFs, Word, RTF, XML, HTML...

Re: PDFBox problem

Posted By:   Richard_Krenek  
Posted On:   Monday, April 18, 2005 08:48 AM

I am not sure if this is the problem, or just something to be aware of. When indexing text, by default, Lucene only does the first 10,000 tokens. You can alter it by changing IndexWriter.maxFieldLength. I do not know if pdftotext works better, but that is what we use at my company. Depending on how the PDF was put together might also depend on your results. If your pDFS are made up of images, with OCR text overlayed on top, you may get some iteresting results. The word "Lucene" may have been OCRed like "Luc ene". So if you were to search on Lucene, you would get no hits. The 2 things I would look at would be the 10,000 term limit, and take a look at some of the text pulled from you PDFs and see if you can find the text you are looking for in it.

From my experiance I prefer pdftotext, it handles badly created PDFs a bit better. This may be fixed with the latest PDFBox 7.x but I have never went back to check.
About | Sitemap | Contact