How to avoid robot detection?
I'm using python+mechanize, attempting to scrape a site. If I visit this site with links, a text-only version of the login page appears. This is what I'd like to see with my scraper. So:
import mechanize USER_AGENT = "Links (2.3pre1; Linux 2.6.32-5-xen-amd64 x86_64; 80x24)" mech = mechanize.Browser(factory=mechanize.RobustFactory()) mech.addheaders = [('User-agent', USER_AGENT)] mech.set_handle_robots(False) resp = mech.open(URLS['start']) fnout("001-login.html", resp.read()) resp.close()
fnout just dumps the string to a file. Yet, when I open 001-login.html, the entirety of the page is the word "Robot". Nothing else.
I haven't made any other requests. It's not like I loaded the page & didn't load the images, or whatever. This was the first request I made, and I put the User-Agent as exactly what the version of Links that the site worked with had. What am I doing wrong (besides trying to scrape a site that doesn't want to be scrape, that is)?
Probably there are other headers that links is sending that Mechanize is not, or vice versa. Hit up http://www.reliply.org/tools/requestheaders.php with both links and Mechanize and see what headers are being sent.