Python convert binary file into string while ignoring non-ascii characters

I have a binary file and I want to extract all ascii characters while ignoring non-ascii ones. Currently I have:

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text))
   file.close

However I'm encountering an error when writing to file UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128). How would I get Python to ignore non-ascii?

Answers


Use the built-in ASCII codec and tell it to ignore any errors, like:

with open(filename, 'rb') as fobj:
   text = fobj.read().decode('utf-16-le')
   file = open("text.txt", "w")
   file.write("{}".format(text.encode('ascii', 'ignore')))
   file.close()

You can test & play around with this in the Python interpreter:

>>> s = u'hello \u00a0 there'
>>> s
u'hello \xa0 there'

Just trying to convert to a string throws an exception.

>>> str(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...as does just trying to encode that unicode string to ASCII:

>>> s.encode('ascii')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 6: ordinal not in range(128)

...but telling the codec to ignore the characters it can't handle works okay:

>>> s.encode('ascii', 'ignore')
'hello  there'

Basically, the ASCII table takes value in range [0, 27) and associates them to (writable or not) characters. So, to ignore non-ASCII characters, you just have to ignore characters whose code isn't comprise in [0, 27), aka inferior or equal to 127.

In python, there is a function, called ord, which accordingly to the docstring

Return the integer ordinal of a one-character string.

In other words, it gives you the code of a character. Now, you must ignore all characters that, passed to ord, return a value greater than 128. This can be done by:

with open(filename, 'rb') as fobj:
    text = fobj.read().decode('utf-16-le')
    out_file = open("text.txt", "w")

    # Check every single character of `text`
    for character in text:
        # If it's an ascii character
        if ord(character) < 128:
            out_file.write(character)

    out_file.close

Now, if you just want to conserve printable characters, you must notice that all of them - in the ASCII table at least - are between 32 (space) and 126 (tilde), so you must simply do:

if 32 <= ord(character) <= 126:

Need Your Help

How to synchronize two lists in two different SPSites using Event Receivers?

sharepoint sharepoint-2010

I have two sites with the same list name and same columns. Now I want to create item updated event receiver wherein when a user update any list item in list 1, it should get updated in list 2. User...

Global unique id for records

xml database

I want to generate an unique id for my records kept in xml database,Whenever I generate an unique id ,I do not want to compare with already used Ids to check if it exists.I want to make a method wh...