Wednesday, October 16, 2002 12:19 PM
I had to write a PDF extractor from scratch using
javax.mail.internet.MimeUtility. I can't disclose details of the code (since it's a corporate project and all) but I can give you some detailed pseudocode that should point you in the right direction.
- Dump the message input stream to a temporary file. The final version of your program will end up deleting this file immediately, but it will be useful to save each dump while you're still debugging.
- Read through the dump file until you find a line beginning with "
Content-Type:". Tokenize this line to find the boundary string which delimits MIME message body parts. Store that boundary string locally.
- Continue to trace through the file, checking for another line beginning with "
Content-Type". Stop searching when you find one of these lines which also includes the token "
- Use the "
NAME=i" directive to determine the final filename of the attachment.
- Between the MIME content header and the next boundary string, write every line from the dump file into another temporary file. When you're done, you can delete the message dump file.
- Construct a decoded input stream from the new file using
MimeUtility.decode(, "base64"). Write this input stream bytewise into an output file. This new output file is the actual PDF document. You can delete all the temporary files at this point.
The actual code is fairly horrid, especially where you have to carefully step over the body part content header to get to the encoded PDF. It does work, however, and is worth the pain. Since most of the function calls in this process can throw exceptions, you'll want to be careful about keeping variables in the correct scope and knowing when to enclose the code in