convert differnet types of documents to XML
1 posts in topic
Flat View  Flat View

Posted By:   koula_soula
Posted On:   Wednesday, December 29, 2004 12:39 PM


I am desperately looking for a tool that converts different types of documents (pdf, ps, doc.. ) to xml.

Does anybody have an idea of such a magic tool??



Re: convert differnet types of documents to XML

Posted By:   Christopher_Koenigsberg  
Posted On:   Thursday, December 30, 2004 07:18 AM

What do you mean by "xml" in this case? You could just make an XML document, with a root node of "mydoc", and stick the entire (pdf, ps, doc) file inside it. Presumably there is some specific structure you are interested in?

If you have to parse some structure out of your source documents, obviously each document type has its own format, but you also have to decide what kind of structure you are interested in, etc.

There are already command-line Unix (most or all are GNU, so they already come with most Linux distributions, I think) tools for things like, extracting text from ps, pdf, doc, and html. But the quality of the result varies wildly, depending on how the information is encoded in the ps or pdf or html.

About | Sitemap | Contact