How to know what's the right encode?

I've decided to learn C++ and I really like the site www.learncpp.com. Now, I would like to make a pdf version of it and print it, so that I can read it on paper. First I've built an url-collector of all the chapters in the site. It works fine.

Now I'm working on creating an html out of the first chapter. I wrote the following:

import requests
from bs4 import BeautifulSoup
import codecs

req = requests.get("http://www.learncpp.com/cpp-tutorial/01-introduction-to-these-tutorials/")
soup = BeautifulSoup(req.text,'lxml')

content = soup.find("div", class_="post-9")

f = open("first_lesson.html","w")
f.write(content.prettify().encode('utf-8'))
f.close()

and I got my first_lesson.html file in the folder. Problem is that when I open the html file to check the result, there are weird symbols (try to run the code and see) here and there.

I added .encode('utf-8') because otherwise I would get the error:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 155: ordinal not in range(128)

How to eliminate those weird symbols? Whats the right encoding? And, in case I got similar problems in the future, how can I know what is the right encoding?

UPDATE: instead of encoding in 'utf-8' I encoded in 'windows-1252' and it worked. But what is the best strategy to understand how to properly encode? Cause I don't think try-this-try-that is a good one

Answers


content.prettify() is a unicode string. It happens that among others it contains the code point U+2014 which maps to the character — (EM DASH). The ASCII codec cannot encode it, because 8212=0x2014 is larger than 127.

You can however encode your unicode string with any encoding that can handle unicode code points, for example utf-16, utf-32, ucs-2, ucs-4 or ucs-8. There is no "right" encoding, however utf-8 is the king of them, so usually it is a good choice when you want to encode a unicode string, but you could have chosen another one (that python supports) and your program would - for example - also work with

f.write(content.prettify().encode('utf-16'))

prettify gives you a unicode string and per default tries the decoding with utf-8 (that's what I understand from having a look at the source), but you can give prettify an explicit encoding to work with as an argument. Think of unicode strings as an abstraction, a series of unicode code points which basically corresponds to a series of characters (which are nothing but small images).

If you ever need to find the content-type of a HTML document with beautifulsoup you may find this and this question useful.

Another point: In general, whenever you have plain bytes and nobody tells you how they are supposed to be decoded, you are out of luck and have to play whack-a-mole. If you know that you are dealing with text, utf-8 is usually a good first guess because it is a) widely used and b) the first 128 unicode characters correspond one-to-one with ASCII and utf-8 encodes them with the same byte values.

You may also find this chartable and this talk from PyCon 2012 useful.


Using requests in python2 you should use .content to let requests take care of the encoding, you can use io.open to write to the file:

import requests
from bs4 import BeautifulSoup
import io


req = requests.get("http://www.learncpp.com/cpp-tutorial/01-introduction-to-these-tutorials/")
soup = BeautifulSoup(req.content, 'lxml')
content = soup.find("div", class_="post-9")

with io.open("first_lesson.html", "w") as f:
    f.write(soup.prettify())

If you did want to specify the encoding, prettify takes an encoding argument soup.prettify(encoding=...), there is also the encoding attribute:

enc = req.encoding

You can parse try parsing the header with cgi.parse_headers:

import cgi

enc = cgi.parse_header(req.headers.get('content-type', ""))[1]["charset"]

Or try installing and using chardet module:

import chardet

enc = chardet.detect(req.content)

You should also be aware that many encodings may run without error but you will end up with garbage in the file. The charset is set to utf-8, you can see it in the headers returned and if you look at the source you can see <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />.


Need Your Help

Regular Expression to match word in the middle but at the end? (php is crashing :(

php regex preg-replace

I want to find lines that are between a label and a return with no line indent.

Wrapping and Passing HWND with Boost.Python

python boost hwnd win32gui py++

I've created a Boost.Python wrapper (using Py++) for a C++ legacy class that takes a HWND window handle in its constructor. However, after exporting the module to python when I try to use it, I get a