What do I use to download all PDFs from a website?

I need to download all the PDF files present on a site. Trouble is, they aren't listed on any one page, so I need something (a program? a framework?) to crawl the site and download the files, or at least get a list of the files. I tried WinHTTrack, but I couldn't get it to work. DownThemAll for Firefox does not crawl multiple pages or entire sites. I know that there is a solution out there, as I couldn't have possibly been the first person to be presented with this problem. What would you recommend?

Answers


From http://www.go2linux.org/tips-and-tricks-of-wget-to-download-files:

wget -r -A pdf http://www.site.com

Google has an option to return only files of a certain type. Combine this with the "site" option and you have your "crawler".

Example: http://www.google.com/search?q=site:soliddocuments.com+filetype:pdf


Use some webcrawling library, eg. in ruby http://www.example-code.com/ruby/spider_begin.asp


If there are no links to PDF files, a crawler won't help and you basically only have two choices:

  1. Get the list from somewhere else (ask the site's Web Master for a list)
  2. Get the list from WebSite's directory listing. Although, if they have disabled this option on their web server, you won't be able to use it.

Need Your Help

Which implementations of SQL have PSM like functionality?

sql mysql oracle db2 informix

Although Oracle is one of the earliest to create stored procedures (PL/SQL) then Informix with (SPL) which RDBMS products besides DB2 have implemented SQL/PSM or a subset of it after 1998?.. Which ...

Text over Image html css

css html5 image text css-position

I'm trying to set an image in HTML and want text to run over it. But when I use the following code the text will run under the image.