Decoding HTML entities with Python

I'm trying to decode HTML entries from here and I cannot figure out what I am doing wrong.

Take for example:

"U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"

I've tried BeautifulSoup, decode('iso-8859-1'), and django.utils.encoding's smart_str without any success.


Try this:

import re

def _callback(matches):
    id =
        return unichr(int(id))
        return id

def decode_unicode_references(data):
    return re.sub("&#(\d+)(;|(?=\s))", _callback, data)

data = "U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"
print decode_unicode_references(data)

Actually what you have are not HTML entities. There are THREE varieties of those &.....; thingies -- for example       all mean U+00A0 NO-BREAK SPACE.

  (the type you have) is a "numeric character reference" (decimal).   is a "numeric character reference" (hexadecimal).   is an entity.

Further reading:

Here you will find code for Python2.x that does all three in one scan through the input:

This does work:

from BeautifulSoup import BeautifulStoneSoup
s = "U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’"
decoded = BeautifulStoneSoup(s, convertEntities=BeautifulStoneSoup.HTML_ENTITIES)

If you want a string instead of a Unicode object, you'll need to decode it to an encoding that supports the characters being used; ISO-8859-1 doesn't:

result = decoded.encode("UTF-8")

It's unfortunate that you need an external module for something like this; simple HTML/XML entity decoding should be in the standard library, and not require me to use a library with meaningless class names like "BeautifulStoneSoup". (Class and function names should not be "creative", they should be meaningful.)

>>> from HTMLParser import HTMLParser
>>> print HTMLParser().unescape('U.S. Adviser’s Blunt Memo on Iraq: '
...                             'Time ‘to Go Home’')
U.S. Adviser’s Blunt Memo on Iraq: Time ‘to Go Home’

The function is undocumented in Python 2. It is fixed in Python 3.4+: it is exposed as html.unescape() there.

Need Your Help

Keyboard shortcut to paste clipboard content into command prompt window (Win XP)

windows keyboard-shortcuts

Is there a keyboard shortcut for pasting the content of the clipboard into a command prompt window on Windows XP (instead of using the right mouse button)?

Mixing Objective C ,(*.m , *.mm & .c /.cpp ) files

c++ objective-c cocoa objective-c++

In my project Core libraries are part of C/C++ files, while UI needs to be developed in Objective C,