How to know what's the right encode?
I've decided to learn C++ and I really like the site www.learncpp.com. Now, I would like to make a pdf version of it and print it, so that I can read it on paper. First I've built an url-collector of all the chapters in the site. It works fine.
Now I'm working on creating an html out of the first chapter. I wrote the following:
import requests from bs4 import BeautifulSoup import codecs req = requests.get("http://www.learncpp.com/cpp-tutorial/01-introduction-to-these-tutorials/") soup = BeautifulSoup(req.text,'lxml') content = soup.find("div", class_="post-9") f = open("first_lesson.html","w") f.write(content.prettify().encode('utf-8')) f.close()
and I got my first_lesson.html file in the folder. Problem is that when I open the html file to check the result, there are weird symbols (try to run the code and see) here and there.
I added .encode('utf-8') because otherwise I would get the error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 155: ordinal not in range(128)
How to eliminate those weird symbols? Whats the right encoding? And, in case I got similar problems in the future, how can I know what is the right encoding?
UPDATE: instead of encoding in 'utf-8' I encoded in 'windows-1252' and it worked. But what is the best strategy to understand how to properly encode? Cause I don't think try-this-try-that is a good one
content.prettify() is a unicode string. It happens that among others it contains the code point U+2014 which maps to the character — (EM DASH). The ASCII codec cannot encode it, because 8212=0x2014 is larger than 127.
You can however encode your unicode string with any encoding that can handle unicode code points, for example utf-16, utf-32, ucs-2, ucs-4 or ucs-8. There is no "right" encoding, however utf-8 is the king of them, so usually it is a good choice when you want to encode a unicode string, but you could have chosen another one (that python supports) and your program would - for example - also work with
prettify gives you a unicode string and per default tries the decoding with utf-8 (that's what I understand from having a look at the source), but you can give prettify an explicit encoding to work with as an argument. Think of unicode strings as an abstraction, a series of unicode code points which basically corresponds to a series of characters (which are nothing but small images).
Another point: In general, whenever you have plain bytes and nobody tells you how they are supposed to be decoded, you are out of luck and have to play whack-a-mole. If you know that you are dealing with text, utf-8 is usually a good first guess because it is a) widely used and b) the first 128 unicode characters correspond one-to-one with ASCII and utf-8 encodes them with the same byte values.
Using requests in python2 you should use .content to let requests take care of the encoding, you can use io.open to write to the file:
import requests from bs4 import BeautifulSoup import io req = requests.get("http://www.learncpp.com/cpp-tutorial/01-introduction-to-these-tutorials/") soup = BeautifulSoup(req.content, 'lxml') content = soup.find("div", class_="post-9") with io.open("first_lesson.html", "w") as f: f.write(soup.prettify())
If you did want to specify the encoding, prettify takes an encoding argument soup.prettify(encoding=...), there is also the encoding attribute:
enc = req.encoding
You can parse try parsing the header with cgi.parse_headers:
import cgi enc = cgi.parse_header(req.headers.get('content-type', ""))["charset"]
Or try installing and using chardet module:
import chardet enc = chardet.detect(req.content)
You should also be aware that many encodings may run without error but you will end up with garbage in the file. The charset is set to utf-8, you can see it in the headers returned and if you look at the source you can see <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />.