Python: How to preserve Ä,Ö,Ü when writing to file
I open 2 files in Python, change and replace some of their content and write the new output into a 3rd file. My 2 input files are XMLs, encoded in 'UTF-8 without BOM' and they have German Ä,Ö,Ü and ß in them. When I open my output XML file in Notepad++, the encoding is not specified (i.e. there's no encoding checked in the 'Encoding' tab). My Ä,Ö,Ü and ß are transformed into something like
When I create the output in Python, I use
with open('file', 'w') as fout: fout.write(etree.tostring(tree.getroot()).decode('utf-8'))
What do I have to do instead?
Some explanation for the xml.etree.ElementTree for Python 2, and for its function parse(). The function takes the source as the first argument. Or it can be an open file object, or it can be a filename. The function creates the ElementTree instance, and then it passes the argument to the tree.parse(...) that looks like this:
def parse(self, source, parser=None): if not hasattr(source, "read"): source = open(source, "rb") if not parser: parser = XMLParser(target=TreeBuilder()) while 1: data = source.read(65536) if not data: break parser.feed(data) self._root = parser.close() return self._root
You can guess from the third line that if the filename was passed, the file is opened in binary mode. This way, if the file content was in UTF-8, you are processing elements with UTF-8 encoded binary content. If this is the case, you should open also the output file in binary mode.
Another possibility is to use codecs.open(filename, encoding='utf-8') for opening the input file, and passing the open file object to the xml.etree.ElementTree.parse(...). This way, the ElementTree instance will work with Unicode strings, and you should encode the result to UTF-8 when writing the content back. If this is the case, you can use codecs.open(...) with UTF-8 also for writing. You can pass the opened output file object to the mentioned tree.write(f), or you let the tree.write(filename, encoding='utf-8') open the file for you.
I think this should work:
import codecs with codecs.open("file.xml", 'w', "utf-8") as fout: # do stuff with filepointer
To write an ElementTree object tree to a file named 'file' using the 'utf-8' character encoding:
When writing raw bytestrings, you want to open the file in binary mode:
with open('file', 'wb') as fout: fout.write(xyz)
Otherwise the open call opens the file in text mode and expects unicode strings instead, and will encode them for you.
To decode, is to interpret an encoding (like utf-8) and the output is unicode text. If you do want to decode first, specify an encoding when opening the file in text mode:
with open(file, 'w', encoding='utf-8') as fout: fout.write(xyz.decode('utf-8'))
If you don't specify an encoding Python will use a default, which usually is a Bad Thing. Note that since you are already have UTF-8 encoded byte strings to start with, this is actually useless.
Note that python file operations never transform existing unicode points to XML character entities (such as ü), other code you have could do this but you didn't share that with us.
I found Joel Spolsky's article on Unicode invaluable when it comes to understanding encodings and unicode.