What is the proper way to use codecs' encoding in Python?

I have an HTML file encoded in utf-8. I want to ouput it to a text file, encoded in utf-8. Here's the code I'm using:

import codecs
IN = codecs.open("E2P3.html","r",encoding="utf-8")
codehtml = IN.read()

#codehtml = codehtml.decode("utf-8") 

texte = re.sub("<br>","\n",codehtml)

#texte = texte.encode("utf-8") 

OUT = codecs.open("E2P3.txt","w",encoding="utf-8")


As you can see, I've tried using both 'decode' and 'codecs'. Neither of these work, my output text file defaults as Occidental (Windows-1252) and some entities become gibberish. What am I doing wrong here?


When opening a UTF-8 file with the codecs module, as you did, the contents of the file are automatically decoded into Unicode strings, so you must not try to decode them again.

The same is true when writing the file; if you write it using the codecs module, the Unicode string you're passing will automatically be encoded to whatever encoding you specified.

To make it explicit that you're dealing with Unicode strings, it might be a better idea to use Unicode literals, as in

texte = re.sub(u"<br>", u"\n",codehtml)

although it doesn't really matter in this case (which could also be written as

texte = codehtml.replace(u"<br>", u"\n")

since you're not actually using a regular expression).

If the application doesn't recognize the UTF-8 file, it might help saving it with a BOM (Byte Order Mark) (which is generally discouraged, but if the application can't recognize a UTF-8 file otherwise, it's worth a try):

OUT = codecs.open("E2P3.txt","w",encoding="utf-8-sig")

Need Your Help

How to select object's DiscriminatorColumn with Eclipse Link

java jpa eclipselink

I have an abstract Entity User and some inherited classes, e.g. Patient that use DiscriminatorColumn as an inheritance method.

How to use hstore on Heroku

ruby-on-rails-3 postgresql heroku hstore

As per https://postgres.heroku.com/blog/past/2012/4/26/heroku_postgres_development_plan/ I did "heroku addons:add heroku-postgresql:dev". But when I do