Unexpected ElementTree behavior

Aug 30, 2006

I've been using the Python ElementTree library for parsing web service responses for my worldinpictures.org¹ site and generally found it reliable and easy to use.

Character encoding issues have caused me a number of problems recently and I've come across another one with ElementTree:

>>> from elementtree import ElementTree as ET
>>> ET.XML('<?xml version="1.0" encoding="utf-8" ?><title>Good morning Mazatl\xc3\xa1n!</title>').text
u'Good morning Mazatl\xe1n!'

>>> ET.XML('<?xml version="1.0" encoding="utf-8" ?><title>Good morning Mazatln!</title>').text
'Good morning Mazatln!'

It seems that if the element contains any non-ASCII characters then the result will be a unicode string otherwise it will be a plain string.

It would be preferable to have a consistent return type (e.g. always unicode or always in the input encoding).

So, in my case, I pass the result through unicode() to ensure I always get a unicode result.

(There's an issue here with the unicode function and its reliance on the default encoding but that belongs in another post...)

UPDATE: worldinpictures.org has been retired.↩

python

worldinpictures

unicode