Unexpected ElementTree behavior

Aug 30, 2006

I've been using the Python ElementTree library for parsing web service responses for my worldinpictures.org1 site and generally found it reliable and easy to use.

Character encoding issues have caused me a number of problems recently and I've come across another one with ElementTree:

>>> from elementtree import ElementTree as ET
>>> ET.XML('<?xml version="1.0" encoding="utf-8" ?><title>Good morning Mazatl\xc3\xa1n!</title>').text
u'Good morning Mazatl\xe1n!'

>>> ET.XML('<?xml version="1.0" encoding="utf-8" ?><title>Good morning Mazatln!</title>').text
'Good morning Mazatln!'

It seems that if the element contains any non-ASCII characters then the result will be a unicode string otherwise it will be a plain string.

It would be preferable to have a consistent return type (e.g. always unicode or always in the input encoding).

So, in my case, I pass the result through unicode() to ensure I always get a unicode result.

(There's an issue here with the unicode function and its reliance on the default encoding but that belongs in another post...)

  1. UPDATE: worldinpictures.org has been retired.