How to convert pdf, ppt, xl, doc files to txt/html files... any opensource tools/codes in php/python/perl available?

My end objective is to index documents using lucene. As lucene doesnt support indexing other formats. I want to convert these files to txt/html (lucene indexable file types). I have a set of documents almost 1000 files of ppt, pdf, doc, xl etc Please help me

Answers


You could use OpenOffice headless to convert the files from one format to another, say Excel/Doc to TXT/HTML.

We use a similar process combined with ImageMagick to allow people to upload office documents into a presentation app.

Below are a few examples/tutorials on how to achieve this:

Setup OpenOffice

http://code.google.com/p/openmeetings/wiki/OpenOfficeConverter

JOD Converter (Java)

http://artofsolving.com/opensource/jodconverter

PyOD Converter (Python)

http://artofsolving.com/opensource/pyodconverter

If you need any further help with OOo feel free to ask

Good luck :)


Need Your Help

Mysql build query dynamically with placeholders using Limit , How to create this in store procedure

mysql sql stored-procedures build limit

I want to implement custom paging in my store procedure there for I used limit, It works fine when I execute query in MY SQL as static

Magento FPC Cache Warm with user groups, wget, Lesti FPC

php shell magento caching lesti-fpc

I'm using Lesti FPC on a Magento site with 10 customer groups and a lot of categories/products.