Decoding HTML entities with Python
I'm trying to decode HTML entries from here NYTimes.com and I cannot figure out what I am doing wrong.
Take for example:
"U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"
I've tried BeautifulSoup, decode('iso-8859-1'), and django.utils.encoding's smart_str without any success.
import re def _callback(matches): id = matches.group(1) try: return unichr(int(id)) except: return id def decode_unicode_references(data): return re.sub("&#(\d+)(;|(?=\s))", _callback, data) data = "U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’" print decode_unicode_references(data)
Actually what you have are not HTML entities. There are THREE varieties of those &.....; thingies -- for example     all mean U+00A0 NO-BREAK SPACE.
(the type you have) is a "numeric character reference" (decimal).   is a "numeric character reference" (hexadecimal). is an entity.
Further reading: http://htmlhelp.com/reference/html40/entities/
Here you will find code for Python2.x that does all three in one scan through the input: http://effbot.org/zone/re-sub.htm#unescape-html
This does work:
from BeautifulSoup import BeautifulStoneSoup s = "U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’" decoded = BeautifulStoneSoup(s, convertEntities=BeautifulStoneSoup.HTML_ENTITIES)
If you want a string instead of a Unicode object, you'll need to decode it to an encoding that supports the characters being used; ISO-8859-1 doesn't:
result = decoded.encode("UTF-8")
It's unfortunate that you need an external module for something like this; simple HTML/XML entity decoding should be in the standard library, and not require me to use a library with meaningless class names like "BeautifulStoneSoup". (Class and function names should not be "creative", they should be meaningful.)
>>> from HTMLParser import HTMLParser >>> print HTMLParser().unescape('U.S. Adviser’s Blunt Memo on Iraq: ' ... 'Time ‘to Go Home’') U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’