Unicode error handling with Python 3's readlines()

I keep getting this error while reading a text file. Is it possible to handle/ignore it and proceed?

UnicodeEncodeError: ‘charmap’ codec can’t decode byte 0x81 in position 7827: character maps to undefined.

Answers


In Python 3, pass an appropriate errors= value (such as errors=ignore or errors=replace) on creating your file object (presuming it to be a subclass of io.TextIOWrapper -- and if it isn't, consider wrapping it in one!); also, consider passing a more likely encoding than charmap (when you aren't sure, utf-8 is always a good place to start).

For instance:

f = open('misc-notes.txt', encoding='utf-8', errors='ignore')

In Python 2, the read() operation simply returns bytes; the trick, then, is decoding them to get them into a string (if you do, in fact, want characters as opposed to bytes). If you don't have a better guess for their real encoding:

your_string.decode('utf-8', 'replace')

...to replace unhandled characters, or

your_string.decode('utf-8', 'ignore')

to simply ignore them.

That said, finding and using their real encoding (rather than guessing utf-8) would be preferred.


You should open the file with a codecs to make sure that the file gets interpreted as UTF8.

import codecs
fd = codecs.open(filename,'r',encoding='utf-8')
data = fd.read()

Yeah..you could wrap it in a

try:
    ....
except UnicodeEncodeError: 
    pass

Need Your Help

How to filter an array from all elements of another array

javascript arrays filter

I'd like to understand the best way to filter an array from all elements of another one. I tried with the filter function, but it doesn't come to me how to give it the values i want to remove. Som...

Sending Multipart File as POST parameters with RestTemplate requests

spring-mvc multipartform-data resttemplate

I am working with Spring 3 and RestTemplate. I have basically, two applications and one of them have to post values to the other app. through rest template.