How to get an HTML file using Python?

I am not very familiar with Python. I am trying to extract the artist names (for a start :)) from the following page: http://www.infolanka.com/miyuru_gee/art/art.html.

How do I retrieve the page? My two main concerns are; what functions to use and how to filter out useless links from the page?

Answers


Example using urlib and lxml.html:

import urllib
from lxml import html

url = "http://www.infolanka.com/miyuru_gee/art/art.html"
page = html.fromstring(urllib.urlopen(url).read())

for link in page.xpath("//a"):
    print "Name", link.text, "URL", link.get("href")

output >>
    [('Aathma Liyanage', 'athma.html'),
     ('Abewardhana Balasuriya', 'abewardhana.html'),
     ('Aelian Thilakeratne', 'aelian_thi.html'),
     ('Ahamed Mohideen', 'ahamed.html'),
    ]

I think "eyquem" way would be my choice too, but I like to use httplib2 instead of urllib. urllib2 is too low level lib for this work.

import httplib2, re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')
http = httplib2.Http()
headers, body = http.request("http://www.infolanka.com/miyuru_gee/art/art.html")
li = pat.findall(body)
print li


  1. Use urllib2 to get the page.

  2. Use BeautifulSoup to parse the HTML (the page) and get what you want!


Check this my friend

import urllib.request

import re

pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')

url = 'http://www.infolanka.com/miyuru_gee/art/art.html'

sock = urllib.request.urlopen(url).read().decode("utf-8")

li = pat.findall(sock)

print(li)

Or go straight forward:

import urllib

import re
pat = re.compile('<DT><a href="[^"]+">(.+?)</a>')

url = 'http://www.infolanka.com/miyuru_gee/art/art.html'
sock = urllib.urlopen(url)
li = pat.findall(sock.read())
sock.close()

print li

And respect robots.txt and throttle your requests :)

(Apparently urllib2 does already according to this helpful SO post).


Basically, there's a function call:

render_template()

You can easly return single page or list of pages with it and it reads all files automaticaly from a your_workspace\templates .

Example:

/root_dir /templates /index1.html, /index2.html /other_dir /

routes.py

@app.route('/') def root_dir(): return render_template('index1.html')

@app.route(/<username>) def root_dir_with_params(username): retun render_template('index2.html', user=username)

index1.html - without params

<html> <body> <h1>Hello guest!</h1> <button id="getData">Get Data!</button> </body> </html>

index2.html - with params

<html> <body> <!-- Built-it conditional functions in the framework templates in Flask --> {% if name %} <h1 style="color: red;">Hello {{ user }}!</h1> {% else %} <h1>Hello guest.</1> <button id="getData">Get Data!</button> </body> </html>


Need Your Help

Vectorizing a matrix

r vector statistics matrix

I have a large 2D matrix that is 1000 x 1000. I want to reshape this so that it is one column (or row). For example, if the matrix was:

javascript refresh prevents cgi execution

javascript iframe cgi refresh

I am trying to refresh the page after a cgi execution. In my cgi, a tcp packet is sent to a port. it is called in a hidden iframe. Here is my javascript code: