Re: convert differnet types of documents to XML
Thursday, December 30, 2004 07:18 AM
What do you mean by "xml" in this case? You could just make an XML document, with a root node of "mydoc", and stick the entire (pdf, ps, doc) file inside it. Presumably there is some specific structure you are interested in?
If you have to parse some structure out of your source documents, obviously each document type has its own format, but you also have to decide what kind of structure you are interested in, etc.
There are already command-line Unix (most or all are GNU, so they already come with most Linux distributions, I think) tools for things like, extracting text from ps, pdf, doc, and html. But the quality of the result varies wildly, depending on how the information is encoded in the ps or pdf or html.