Indexing Encrypting PDF document
1 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Alfred_Sniff
Posted On:   Tuesday, April 26, 2005 03:06 AM

Hi again, I use PDFBox 7.0.1 in order to index my pdf documents. Today, I have some problems indexing some of them. I don't know if the problem comes from encryption date or size document. In fact, my problems come every time with more than 2Mo pdf documents but I think that this is a coincidence. I think that the problem is that the pdf is encrypted. So i tried to use the class PDFBoxPDFHandler from LuceneInAction code source, but it doesn't work or I don't use it correctly. Somebody know this problem? how to fix it? here is an url where there's a document which didn't work with me : http://tel.ccsd.cnrs.fr/documents/archives0/00/00/17/67/tel-00001767-00/tel-00001767.pdf Moreover there's a problem when there's some characters   More>>

Hi again,

I use PDFBox 7.0.1 in order to index my pdf documents. Today, I have some problems indexing some of them. I don't know if the problem comes from encryption date or size document. In fact, my problems come every time with more than 2Mo pdf documents but I think that this is a coincidence. I think that the problem is that the pdf is encrypted. So i tried to use the class PDFBoxPDFHandler from LuceneInAction code source, but it doesn't work or I don't use it correctly. Somebody know this problem? how to fix it? here is an url where there's a document which didn't work with me :

http://tel.ccsd.cnrs.fr/documents/archives0/00/00/17/67/tel-00001767-00/tel-00001767.pdf

Moreover there's a problem when there's some characters like chinese or russian. Is there a solution with an analyzer which treats every characters from everywhere?

Thanks for answers
Best Regards

   <<Less

Re: Indexing Encrypting PDF document

Posted By:   Richard_Krenek  
Posted On:   Tuesday, April 26, 2005 06:08 AM

I cannot answer all your questions but yes the PDF you pointed to is set so you cannot copy text from it. Adobe's encryption is very easy to get around, but a better solution is to work with the people that provided the PDF and either get the passwords for the PDFs or have them give you unsecured PDFs. If the PDFs are secured and the provider does not give you unsecured PDFs or the password, it can be assumed they really do not want you to pull text from their PDFs.


As far as determining what fonts are in a PDF, I do not know off the top of my head how to do that, you may want to post to a PDF Box forumn on this issue. Nor do I know if there is a gerneral analyzer that can help you. You may need to write some specific code to do that.
About | Sitemap | Contact