Monday, April 18, 2005 08:48 AM
I am not sure if this is the problem, or just something to be aware of. When indexing text, by default, Lucene only does the first 10,000 tokens. You can alter it by changing IndexWriter.maxFieldLength. I do not know if pdftotext works better, but that is what we use at my company. Depending on how the PDF was put together might also depend on your results. If your pDFS are made up of images, with OCR text overlayed on top, you may get some iteresting results. The word "Lucene" may have been OCRed like "Luc ene". So if you were to search on Lucene, you would get no hits. The 2 things I would look at would be the 10,000 term limit, and take a look at some of the text pulled from you PDFs and see if you can find the text you are looking for in it.
From my experiance I prefer pdftotext, it handles badly created PDFs a bit better. This may be fixed with the latest PDFBox 7.x but I have never went back to check.