Python unicode function weirdness

Aug 30, 2006

Python has a built-in function called unicode which is intended to convert strings to unicode.

When called with only one argument (the string to convert) it will assume the string is encoded in the default encoding. This is normally ASCII but can be overridden in site.py.

I would normally want to write code that would work regardless of the default encoding. Thankfully unicode can take an additional argument to allow you to specify an encoding rather than using the default. Unfortunately, and for no reason I can think of, supplying this argument causes the function to behave differently when given a unicode string as input:

>>> unicode(u"abc")
u'abc'

>>> unicode(u"abc", "ascii")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
TypeError: decoding Unicode is not supported

This strikes me as rather bizarre behavior. Surely unicode(s) ought to bahave exactly the same as unicode(s, sys.getdefaultencoding())?