python - French and lxml text

Question

I'm trying to assign a valid French text string to a text string using lxml:

el = etree.Element("someelement")
el.text = 'Disponible Ã  partir du 1er Octobre'

I get the error:

ValueError: All strings must be XML compatible: Unicode or ASCII, no NULL bytes or control characters

I've also tried:

el.ext = etree.CDATA('Disponible Ã  partir du 1er Octobre')

However I get the same error.

How do I handle French in XML, in particular, ISO-8859-1? There are ways to specify encoding within the tostring() function in lxml, but not for assigning text values within elements.

score 5 · Accepted Answer

5

If you have version of python < 3 you can try: el.text = u'Disponible Ã partir du 1er Octobre'

于 2013-06-16T21:54:54.180 回答

score 5 · Accepted Answer

If text contains non-ascii data then you should provide it as a Unicode string for el.text.

As @Abbasov Alexander's answer shows you could do it using a Unicode literal u''. Python hasn't raise an exception so I assume that you've declared a character encoding of your Python source file (e.g., using # coding: utf-8 comment at the top). This encoding defines how Python interprets non-ascii characters in the source, it is unrelated to the encoding you use to save xml to a file.

If the text is already in a variable and you haven't converted it to Unicode yet, you could do it using text.decode(text_encoding) (text_encoding may be unrelated to the Python source encoding).

The confusing bit might be that el.text (as an optimization) returns a bytestring on Python 2 for pure ascii data. It breaks the rule that you should not mix bytes and Unicode strings. Though It should work if sys.getdefaultencoding() returns an ascii-based encoding as it does in most cases.

To save xml, pass any character encoding you need totostring() or ElementTree.write() functions. Again, this encoding is unrelated to others already mentioned encodings.

In general, use Unicode sandwich: decode bytes to Unicode as soon as you receive them, work with Unicode text inside your program, encode to bytes as late as possible when you need to send the text using API that doesn't support Unicode (files, network).

python - French and lxml text

2 回答 2

Related

Reference