What do I use to download all PDFs from a website?
I need to download all the PDF files present on a site. Trouble is, they aren't listed on any one page, so I need something (a program? a framework?) to crawl the site and download the files, or at least get a list of the files. I tried WinHTTrack, but I couldn't get it to work. DownThemAll for Firefox does not crawl multiple pages or entire sites. I know that there is a solution out there, as I couldn't have possibly been the first person to be presented with this problem. What would you recommend?
wget -r -A pdf http://www.site.com
Google has an option to return only files of a certain type. Combine this with the "site" option and you have your "crawler".
Use some webcrawling library, eg. in ruby http://www.example-code.com/ruby/spider_begin.asp
If there are no links to PDF files, a crawler won't help and you basically only have two choices:
- Get the list from somewhere else (ask the site's Web Master for a list)
- Get the list from WebSite's directory listing. Although, if they have disabled this option on their web server, you won't be able to use it.