What is the proper way to use codecs' encoding in Python?
I have an HTML file encoded in utf-8. I want to ouput it to a text file, encoded in utf-8. Here's the code I'm using:
import codecs IN = codecs.open("E2P3.html","r",encoding="utf-8") codehtml = IN.read() #codehtml = codehtml.decode("utf-8") texte = re.sub("<br>","\n",codehtml) #texte = texte.encode("utf-8") OUT = codecs.open("E2P3.txt","w",encoding="utf-8") OUT.write(texte) IN.close() OUT.close()
As you can see, I've tried using both 'decode' and 'codecs'. Neither of these work, my output text file defaults as Occidental (Windows-1252) and some entities become gibberish. What am I doing wrong here?
When opening a UTF-8 file with the codecs module, as you did, the contents of the file are automatically decoded into Unicode strings, so you must not try to decode them again.
The same is true when writing the file; if you write it using the codecs module, the Unicode string you're passing will automatically be encoded to whatever encoding you specified.
To make it explicit that you're dealing with Unicode strings, it might be a better idea to use Unicode literals, as in
texte = re.sub(u"<br>", u"\n",codehtml)
although it doesn't really matter in this case (which could also be written as
texte = codehtml.replace(u"<br>", u"\n")
since you're not actually using a regular expression).
If the application doesn't recognize the UTF-8 file, it might help saving it with a BOM (Byte Order Mark) (which is generally discouraged, but if the application can't recognize a UTF-8 file otherwise, it's worth a try):
OUT = codecs.open("E2P3.txt","w",encoding="utf-8-sig")