I was looking for the technical requirements of a PDF. I believe text can be kept in 2 methods into a PDF: a) as a text layer above the image layer (as explained in the website above) b) when you develop a PDF from a Word document (with text), I do not believe Word will store all the text in the text layer.
Because PDF 1.4, XMP has been included (http://en.wikipedia.org/wiki/Extensible_Metadata_Platform). What is XMP? Is this the “text layer” which I discussed above?
If a scanner is performing OCR on an image, is it storing the text in the “text layer”? Or the “XMP” field? This can only be when a PDF is of version 1.4?
And how can I detect if a PDF currently has text data? For example: PDF A has actually been scanned with OCR and PDF B has not. How can I understand that PDF B should be sent out to a different OCR engine?
Usually, after OCR the text is added in ‘undetectable’ text rendering mode to the typical content of the PDF (not an additional layer, that’s made undetectable– which is likewise a technical possibility in PDF; try to find Optional Content in the PDF specification).– Nevertheless, in real life PDFs (both, ‘scanned’ along with ‘regular’ PDFs), you’ll typically find that you can select the text and copy it– but after pasting, you’ll only have gobbledigook. Or if you use pdftotext on such a file … If so, then it’s a problem with the encoding of the font utilized
The PDF specification has no reference of a ‘text layer’. Typically, there is simply one method to ‘save’ text: by means of text showing operators. These operators draw text at a specific location, using a specific color, typeface, typeface size and text rendering mode. There are a number of text rendering modes. For the function of addressing your question, text can be noticeable or undetectable.
The text is rendered using the invisible text rendering mode. The result is that you can select the text utilizing a mouse (the highlighted area will be shown at the expected place on top of the image) and you can search for text.
Exactly what occurs when you produce PDF from a Word document depends on the software that you use to convert. To my knowledge, these converters do not generate an image but they will generate visible text.
XMP is meta information instead of visual data.
i have a python script that convert pdf file to text file. the system ask the user to the course of the folder which contains the PDF files.
the problem is that the script simply transform one file, what requirement is to make the script transform all the PDF files that exist in the defined directory site.
Apart from no increment of variable i of while loop, you are also using the very same variable name i in the for loop. So, after leaving the for loop the value of the variable i has currently changed. You must utilize different variable names in while and for loop.