Conversion of PDF TO XML-XSL - Java?

How to convert a PDF to XML and capture its structure/styling in XSL?

Answers


I once described PDF-to-XML conversion as trying to convert hamburgers into cows. It's an exercise in reverse engineering. PDF is very variable in the way it represents text; in the worse case, all you have is a scanned image (in which case you are essentially doing OCR). If you're lucky, you have a collection of strings of text with the coordinates of where they appear on the page, but no other indication of structure.

There are tools that do a reasonable job (usually producing Microsoft Word) if the PDF is in a form that they understand. Google "PDF to Word conversion". Try them out (it's a while since I did so); don't try to write your own. From Word, of course, getting to XML is "relatively" straightforward.


PDFTextStream can readily extract text from PDF documents as XML. One particular PDF->XML approach is included with PDFTextStream — XMLOutputTarget — the source for which is included with PDFTextStream so you can easily tweak it to suit your requirements.

Code samples are available to get started, or you can read more in-depth about how PDF text extraction with PDFTextStream works.

(Disclosure: I am employed by Snowtide, the makers of PDFTextStream. I hope this pointer is helpful in any case.)


I think Michael Kay nailed it when he described PDF -> XML conversion as 'trying to convert hamburgers into cows'.

I've done quite a bit of PDF to XML conversion in the past. I've been lucky in that I've got decent PDFs to convert that didn't require OCR. Most of my issues were around tables and graphics. Converting to Word first like Michael suggests may help with those.

What I did was convert the PDF to text using pdftotext from Xpdf and then convert the text to XML. (I used Omnimark for the text -> XML conversion, but you could probably use Java or Python to do the conversion. It might be easiest to convert to a basic structure and then use XSLT (2.0!) to fine tune it.


Need Your Help

Timeout in connect() not working

c++ sockets winsock winsock2

I'm using this code to connect to a server, but it is not waiting the 10 seconds I set to timeout. It returns immediately after failing to connect.

Is there a Speech to text from voice file for Android?

android speech

So I've looked at some of the posts here and ore mother sites and since times change quickly in our world would like to know if there is any SDK or API out there yet that will essentially take an a...