How to avoid robot detection?

I'm using python+mechanize, attempting to scrape a site. If I visit this site with links, a text-only version of the login page appears. This is what I'd like to see with my scraper. So:

import mechanize

USER_AGENT = "Links (2.3pre1; Linux 2.6.32-5-xen-amd64 x86_64; 80x24)"
mech = mechanize.Browser(factory=mechanize.RobustFactory())
mech.addheaders = [('User-agent', USER_AGENT)]

resp =['start'])

fnout just dumps the string to a file. Yet, when I open 001-login.html, the entirety of the page is the word "Robot". Nothing else.

I haven't made any other requests. It's not like I loaded the page & didn't load the images, or whatever. This was the first request I made, and I put the User-Agent as exactly what the version of Links that the site worked with had. What am I doing wrong (besides trying to scrape a site that doesn't want to be scrape, that is)?


Probably there are other headers that links is sending that Mechanize is not, or vice versa. Hit up with both links and Mechanize and see what headers are being sent.

Need Your Help

jQuery set active/hover background images

jquery css events

I have a list that has 5 items in, the first one (on load) is active (with a class of 'active' added by jQuery). They all have background images associated with them. If I click any of the li's ano...

Finding each page's relative location in a website directory structure using a master page?

c# .net

I'm a rookie web developer creating a site in C#, in VS2005 and using a .master file.