How to get an HTML file using Python?

I am not very familiar with Python. I am trying to extract the artist names (for a start :)) from the following page:

How do I retrieve the page? My two main concerns are; what functions to use and how to filter out useless links from the page?


Example using urlib and lxml.html:

import urllib
from lxml import html

url = ""
page = html.fromstring(urllib.urlopen(url).read())

for link in page.xpath("//a"):
    print "Name", link.text, "URL", link.get("href")

output >>
    [('Aathma Liyanage', 'athma.html'),
     ('Abewardhana Balasuriya', 'abewardhana.html'),
     ('Aelian Thilakeratne', 'aelian_thi.html'),
     ('Ahamed Mohideen', 'ahamed.html'),

I think "eyquem" way would be my choice too, but I like to use httplib2 instead of urllib. urllib2 is too low level lib for this work.

import httplib2, re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')
http = httplib2.Http()
headers, body = http.request("")
li = pat.findall(body)
print li

  1. Use urllib2 to get the page.

  2. Use BeautifulSoup to parse the HTML (the page) and get what you want!

Check this my friend

import urllib.request

import re

pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')

url = ''

sock = urllib.request.urlopen(url).read().decode("utf-8")

li = pat.findall(sock)


Or go straight forward:

import urllib

import re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')

url = ''
sock = urllib.urlopen(url)
li = pat.findall(

print li

And respect robots.txt and throttle your requests :)

(Apparently urllib2 does already according to this helpful SO post).

Basically, there's a function call:


