How to avoid robot detection?

I'm using python+mechanize, attempting to scrape a site. If I visit this site with links, a text-only version of the login page appears. This is what I'd like to see with my scraper. So:

import mechanize

USER_AGENT = "Links (2.3pre1; Linux 2.6.32-5-xen-amd64 x86_64; 80x24)"
mech = mechanize.Browser(factory=mechanize.RobustFactory())
mech.addheaders = [('User-agent', USER_AGENT)]
mech.set_handle_robots(False)

resp = mech.open(URLS['start'])
fnout("001-login.html", resp.read())
resp.close()

fnout just dumps the string to a file. Yet, when I open 001-login.html, the entirety of the page is the word "Robot". Nothing else.

I haven't made any other requests. It's not like I loaded the page & didn't load the images, or whatever. This was the first request I made, and I put the User-Agent as exactly what the version of Links that the site worked with had. What am I doing wrong (besides trying to scrape a site that doesn't want to be scrape, that is)?

Answers


Probably there are other headers that links is sending that Mechanize is not, or vice versa. Hit up http://www.reliply.org/tools/requestheaders.php with both links and Mechanize and see what headers are being sent.


Need Your Help

jQuery set active/hover background images

jquery css events

I have a list that has 5 items in, the first one (on load) is active (with a class of 'active' added by jQuery). They all have background images associated with them. If I click any of the li's ano...

Finding each page's relative location in a website directory structure using a master page?

c# asp.net .net

I'm a rookie web developer creating a site in C#, in VS2005 and using a .master file.