UnicodeForText
Unicode for text proposal
Status
Author
Problem
Zope 3 will support internationalization (i18n). To do this, it needs a way of representing text characters that can't be represented as single-byte characters.
text is defined as data that are used for human discourse or applications of same (e.g. cataloging). Things that are text include document content, user interface prompts, content names, ids, and so on. Things that are not text are internal identifiers, binary data, etc.
Alternative approaches were discussed in the UnicodeOrUTF8 discussion. There were three alternatives:
- Unicode (U),
- UTF-8 (S), or
- Mixed
UTF-8 is not a good choice for string operations like splitting, searching, cataloging, etc.
A mixed model would be attractive if there was automatic conversion from UTF-8 strings to unicode. This is only possible if Python's global default encoding is changed. Changing Python's default encoding requires changes to a site's installation or to the unicode modules, which is unattractive. In addition, assuming that strings contain UTF-8 is probably not safe. The default encoding assumes ascii strings and raises an error when non-ascii strings are given.
Proposal
For Zope 3, we will standardize internally on unicode for textual data. Facilities for manipulating and storing text will do so as unicode objects. There are two special cases:
- Input and output facilities will need to decode and encode data. This will depend on environmental data like request headers. For example, data may be provided by a browser in UTF-8 and need to be converted to unicode before it is passed to objects for storage or manipulation.
- Facilities may accept either unicode or strings, converting
strings to unicode using the default (ascii) encoding.
This will typically be accomplished by converting all text arguments with the unicode builtin function, as in:
def foo(self, id, title, description): title = unicode(title) description = unicode(description) .... Unicode variants and storage Python can be configured to use 2-byte unicode or 4-byte unicode. The default configuration is to use 2-byte unicode, which is adequate for most character sets. For most European, languages, 2-byte unicode is a little less effecient in memory usage than UTF-8, however, 2-byte unicode is more memory efficient for many non-European langauages. Unicode data are serialized as UTF-8 in pickles, and, thus, in the Zope Object Database (ZODB). If a site needs 4-byte unicode, it can do so and read data written by a 2-byte unicode site, however, a site using 2-byte unicode may not be able to read data written by a site using 4-byte unicode. We will leave the selection of 2-byte or 4-byte unicode up to site managers, but expect most people to use the default 2-byte unicode.
- gvanrossum (May 17, 2002 4:15 pm; Comment #1)
- I'm generally +1 on this. One question:
> - Facilities may except either unicode or strings, converting > strings to unicode using the default (ascii) encoding. > > This will typically be accomplished by converting all text > arguments with the unicode builtin function, as in:: > > def foo(self, id, title, description): > > title = unicode(title) > description = unicode(description) > > ....
What is the reason for aggressively converting to Unicode? If s in an ASCII string, and u is a Unicode string, combining s and u in an expression (e.g. s+u or u.split(s)) automatically applies the default conversion to unicode on s, just as if unicode(s) was written. So we might as well leave ASCII strings alone. This doesn't do anything for Latin-1 strings, except perhaps move the error to an earlier point -- but the fact that you propose to use unicode() suggests that we don't really expect Latin-1 strings at all.
- faassen (May 17, 2002 5:52 pm; Comment #2)
- I'm +1 on this. Just today I spent a whole day trying to work out the
intricacies of encodings in browsers, XML, Zope and Python, and their
interactions. Life would've been a lot easier if the Zope core (including Page Templates, there's an errant str() and probably more in there, in Zope 2.5 and also still in Zope 2.6 CVS) could work properly with unicode, and a lot easier still if all strings in Zope could be depended on to be in unicode.
Then, for output, do encoding over whole web pages after they are generated, so that is easy. Transforming input to unicode will be more tricky, but still a lot simpler than not doing it. :)
If indeed we can get rid of latin-1 strings reliably and turn them into unicode early on in any input (or code that manipulates, say, a DOM), leaving pure ASCII strings alone is fine with me, though I can see a bit of a purity/simplicity argument there.
But if allowing simple ASCII strings leads to latin1 sneaking into the system, then I'm all for catching/translating that as early as possible. The confusion coming from this pretty bad; the unicode part was, until very recently, one of the darkest parts of Python to me and I suspect there are a lot of developers out there who are even less clued in than I am. Just this evening I helped Stephan Richter with this, for instance, and he co-wrote one of the Zope internationalization tools!
So absolute utter simplicity is essential here if we don't want to confuse innocent developers with strange unicode errors deep inside Zope.
I'm very glad I saw this issue come up. Again, Zope 3 looks to make my life a lot easier than it is with Zope 2. I could've done something else this afternoon instead. :)
- chrisw (May 18, 2002 4:36 am; Comment #3)
- Some questions:
- How come content names can't be unicode? That ight be limiting for a lot of users...
- What will happen with latin1 stuff? I mean, as soon as I put a £ into python's unicode() function, it bleats about ordinals out of range. The fact that I can only change this default by actually hacking Python itself seems wrong to me :-(
- jim (May 18, 2002 1:39 pm; Comment #4)
> gvanrossum (May 17, 2002 4:15 pm; Comment #1) --
...
> What is the reason for aggressively converting to Unicode? > If s in an ASCII string, and u is a Unicode string, combining s and u > in an expression (e.g. s+u or u.split(s)) automatically applies > the default conversion to unicode on s, just as if unicode(s) was > written. So we might as well leave ASCII strings alone. > This doesn't do anything for Latin-1 strings, except perhaps move > the error to an earlier point -- but the fact that you propose to > use unicode() suggests that we don't really expect Latin-1 strings > at all.
Moving the error to an earlier point is the point. A method that takes a string may just save it for later processing. If the string is Latin-1, it may be saved and an error occur much later when the cause is harder to diagnose.
If a routine is going to do a computation that is guarenteed to promote the argument to unicode, then the explicit conversion could be avoided, although I think this might be a bit brittle.
- jim (May 18, 2002 1:40 pm; Comment #5)
> chrisw (May 18, 2002 4:36 am; Comment #3) -- > Some questions: > > 1. How come content names can't be unicode? That might be limiting for a > lot of users...
This is a big enough topic that I've done something wild and crazy and moved the discussion to the mailing list by answering there. :)
> 2. What will happen with latin1 stuff? I mean, as soon as I put a £ into > python's unicode() function, it bleats about ordinals out of range. The > fact that I can only change this default by actually hacking Python > itself seems wrong to me :-(
Just supply the encoding to the unicode function:
s = unicode("\xa3", "latin-1")
or use a unicode string:
s = u'\xa3'
