How to convert pdf, ppt, xl, doc files to txt/html files... any opensource tools/codes in php/python/perl available?

My end objective is to index documents using lucene. As lucene doesnt support indexing other formats. I want to convert these files to txt/html (lucene indexable file types). I have a set of documents almost 1000 files of ppt, pdf, doc, xl etc Please help me


You could use OpenOffice headless to convert the files from one format to another, say Excel/Doc to TXT/HTML.

We use a similar process combined with ImageMagick to allow people to upload office documents into a presentation app.

Below are a few examples/tutorials on how to achieve this:

Setup OpenOffice

JOD Converter (Java)

PyOD Converter (Python)

If you need any further help with OOo feel free to ask

Good luck :)

