java search in documents
2 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Ganesh_Venkat
Posted On:   Monday, April 9, 2001 01:35 AM

hi, presently i am working on an java based searching in documents. Like if an user specifies a word (for eg. Jsp) the application will start searching for the word(Jsp) within the documents like .txt,.doc,.html, .pdf etc. from the root directory specified. My application works fine for .txt,.doc and .html files but the main problem is with .PDF FILES. I am presently done this with StreamTokenizer. Since pdf file is in different format it is not able to match the word with .pdf file. Is there any parser for searching in .pdf files?? i am stranded waiting for ur reply as it is very urgent congrats in advance    More>>

hi,

presently i am working on an java based searching in documents.

Like if an user specifies a word (for eg. Jsp) the application will start searching for the word(Jsp) within the documents like .txt,.doc,.html, .pdf etc. from the root directory specified.

My application works fine for .txt,.doc and .html files but the main problem is with .PDF FILES.

I am presently done this with StreamTokenizer. Since pdf file is in different format it is not able to match the word with .pdf file.

Is there any parser for searching in .pdf files?? i am stranded

waiting for ur reply as it is very urgent

congrats in advance

   <<Less

Re: java search in documents

Posted By:   Kiran_Kumar  
Posted On:   Monday, August 6, 2001 05:35 AM

Hi Ganesh, We are faced with a similar requirement.

We stumbled upon IBM Bridge2Java (http://www.alphaworks.ibm.com/tech/bridge2java) that converts .doc/.xls to text file which can then be searched.


It would be great if you could share your approach in searching .doc files. We are stuck with PDF. Please let us know if you had any luck searching PDF files.



Also any additional tips on searching/indexing in Java would be of great help.


Thank you.

Re: java search in documents

Posted By:   Tim_Rohaly  
Posted On:   Wednesday, May 16, 2001 10:22 AM

Since you say you are able to deal with .doc files, you
surely must be aware of what the problem is. Microsoft
Word stores documents in a binary format - simply searching
for a text string in a .doc file will always fail. Instead, you
need to parse the .doc file to extract the textual information,
which can then be searched. The same holds true of
Adobe Portable Document Format files (although PDF is
not a pure binary format). The only way I know of to
parse PDF is to use Adobe's PDF APIs, which are callable
from Java as native code using the JNI.
About | Sitemap | Contact