Removing non-breaking spaces from strings using Python
I am having some trouble with a very basic string issue in Python (that I can't figure out). Basically, I am trying to do the following:
'# read file into a string myString = file.read() '# Attempt to remove non breaking spaces myString = myString.replace("\u00A0"," ") '# however, when I print my string to output to console, I get: Foo **<C2><A0>** Bar
I thought that the "\u00A0" was the escape code for unicode non breaking spaces, but apparently I am not doing this properly. Any ideas on what I am doing wrong?
You don't have a unicode string, but a UTF-8 list of bytes (which are what strings are in Python 2.x).
myString = myString.replace("\xc2\xa0", " ")
Better would be to switch to unicode -- see this article for ideas. Thus you could say
uniString = unicode(myString, "UTF-8") uniString = uniString.replace(u"\u00A0", " ")
and it should also work (caveat: I don't have Python 2.x available right now), although you will need to translate it back to bytes (binary) when sending it to a file or printing it to a screen.
No, u"\u00A0" is the escape code for non-breaking spaces. "\u00A0" is 6 characters that are not any sort of escape code. Read this.
I hesitate before adding another answer to an old question, but since Python3 counts a Unicode "non-break space" character as a whitespace character, and since strings are Unicode by default, you can get rid of non-break spaces in a string s using join and split, like this:
s = ' '.join(s.split())
This will, of course, also change any other white space (tabs, newlines, etc). And note that this is Python3 only.
Please note that a simple myString.strip() will remove not only spaces, but also non-breaking-spaces from the beginning and end of myString. Not exactly what the OP asked for, but still very handy in many cases.
There is no indication in what you write that you're necessarily doing anything wrong: if the original string had a non-breaking space between 'Foo' and 'Bar', you now have a normal space there instead. This assumes that at some point you've decoded your input string (which I imagine is a bytestring, unless you're on Python 3 or file was opened with the function from the codecs module) into a Unicode string, else you're unlikely to locate a unicode character in a non-unicode string of bytes, for the purposes of the replace. But still, there are no clear indications of problems in what you write.
Can you clarify what's the input (print repr(myString) just before the replace) and what's the output (print repr(myString) again just after the replace) and why you think that's a problem? Without the repr, strings that are actually different might look the same, but repr helps there.
You can simply solve this issue by enforcing the encoding.
cleaned_string = myString.encode('ascii', 'ignore')