UnicodeOrUTF8

Unicode vs. UTF-8 as internal storage format for strings in Zope 3

Author

Andreas Jung

Status

This document is not a proposal but the summary of a discussion during the Zope Internationalization sprint on Jan 17 2002.

Participants: Stefan Richter, Jim Fulton, Juan David Ibanez Palomar, Andreas Jung

This wiki page intents to collect input from the community to make a final decision based on the results of this discussion.

Discussion

We discussed how Unicode support in Zope 3 should look like and especially in what way strings are stored internally. We identified the following three approaches:

  • Unicode storage (U)

    All texts would be stored as Python unicode strings. Incoming and outgoing texts must be converted to/from unicode.

    Pros:

    • unified storage and manpipulation of texts
    • most general solution

    Cons:

    • requires more space (every single unicode string allocates 56 bytes plus 2 or 4 times the length of original string (depending on using UCS-2 or UCS-4 as Pythons internal unicode storage)
  • Texts encoded using UTF-8 (S)

    All texts would be stored as standard Python strings using UTF-8 encoding.

    Pros:

    • more efficient in terms of space usage and speed of string manipulation
    • (maybe) better support in 3rd party applications like editors, databases and browsers

    Cons:

    • string manipulations are more difficult (e.g. len() for a UTF-8 encoded string returns the number of bytes of the encoded string but not the real length of the string)
  • Mixed unicode and UTF-8 encoded texts (M)

    API methods are expected to deal with both unicode and UTF-8 encoded texts. It would be up to the component how to handle texts internally. A method is required to return texts either UTF-8 encoded or as unicode string.

    Pros:

    • most flexible

    Cons:

    • increased efforts for type checking and converting
    • more difficult string manipulation (see UTF-8)

Variations

Instead of using UTF-8 it would be also possible to use UTF-16/32 as encoding. The pros/cons are similiar.

All three alternatives are marked with letters U, S and M. You can use these letters to refer to the corresponding approach.

Feel free to add comments or share your thoughts. No final decision has been made; the decision will depend on the results of this discussion.


stevea (Jan 22, 2002 9:08 am; Comment #2)
Here's a far-out idea for the M approach: Use a hint in a method's docstring to indicate whether it wants arguments in Unicode or as UTF-8. Now that I've said it, I hope no-one takes it seriously :-)
snej (Jan 22, 2002 9:13 am; Comment #3)
U: UCS-2/UCS-4 strings do not require necessarily more space than UTF-8. This depends on the part of unicode used most frequently.

U: If space really is a problem, maybe an alternative Unicode String implementation optimized for Western Countries using UTF-8 internally is a possible solution.

M: Requests can be encoded in a lot of different charsets. When will they be converted to Unicode/UTF-8 for the API?

M: What should the API do with Non-UTF-8 strings?

S: It should be easy to convert output to UTF-8 (or any other charset) if 3rd party apps require it.

EIONET (Jan 22, 2002 9:35 am; Comment #4)
It seems to me that the output to the webbrowsers must be in UTF-8. Therefore if the strings are stored in Unicode, then zope must always convert. Same problem when data is submitted to zope through webforms or received through i.e. XML-RPC.
htrd (Jan 23, 2002 4:53 am; Comment #5)
This description mixes up two distinct concepts: APIs? and storage.

I cant see any good reason that APIs? should use anything other than Unicode objects..... thats what they are designed for. If some optimisation posibility makes any other option seem appealing then we should consider applying that optimisation to Python's standard unicode object - not Zope.

Document objects may choose an internal representation that uses a different encoding, if that saves space. They might gzip the data too.

bth (Jan 23, 2002 10:27 am; Comment #6)
One vote for Unicode as an API standard - mine is an Asian language environment (Japanese), UTF-8 is not in general use in browsers here or anywhere else in Asia. Internal storage is a different issue but probably in most cases uncompressed is OK - disk and memory is cheap these days...
chrisw (Jan 28, 2002 10:45 am; Comment #7)
I'd go for full unicode through and through. If it causes speed problems, then fix those in python. Unicode is important and half-wimping out by going for UTF8 is just asking for trouble down the line.
ajung (Feb 9, 2002 7:22 pm; Comment #8)
Python 2.2 has an -U option that treats all strings as unicode strings. This would be a very way to promote Unicode usage inside Zope.
webwurst (Feb 11, 2002 8:14 pm; Comment #9)
I vote for (s) because of the easier Possibility to output to a Browser and to process form-values.
ajung (Feb 13, 2002 9:50 am; Comment #10)
There has been a related discussion on the linux-utf8 list (http://mail.nl.linux.org/linux-utf8/2000-08/msg00025.html). They discussed either to use UTF-8,-16 or 32. There was some consensus that UTF-16 is a good choice in terms of memory consumption and processing speed. UTF-8 requires typically less spaces for european languages but requires 50% more storage than UTF-16.

However this discussion is only important for us when we decide to use encoded unicode strings instead or in addition of Python unicode strings.

efge (Feb 13, 2002 10:37 am; Comment #11)
UTF-16 is really a bad choice in my opinion :
  • It sits between two chairs: the "minimal" choice, UTF-8, and the unexpanded form, UCS-4 (i.e. full 4-byte Unicode),
  • It can't even code the full 4-byte Unicode range,
  • It doesn't have some of the nice properties of UTF-8 (no 0x00 byte),
  • It's not a web-browser recognized standard, so we would have conversion to do in all cases for output.

Also with UTF-16 you can not determine character properties (number of characters) any more easily than with UTF-8 (or UCS-4 for that matter), because in doing such an operation you have canonicalization (combining characters, reordering) to take into account. So the "length" argument is void with any encoding choice.

My vote would go for UTF-8 as internal representation, or python unicode strings but only if that storage is efficient enough in terms of space (which I hope it is).

Also the APIs? should IMHO not accept both the internal representation and another external one like latin-1, we should stick to a single public format.

One concern I may have with UTF-8 as internal format is the fact that not all byte strings are valid UTF-8 strings, and that for security reasons we should not store invalid ones. Which begs the question, who is responsible for checking their validity ? And doing this checking several times would be wasteful.

All in all I think I'm leaning for python unicode strings (as internal representation).

kentsin (2002-02-21)
This may not very directly related: that currently a valid object id can not contain certain characters. If we choose utf-8 way, then it may invalid many combinations which contain invalid character as valid object id for zope. In case of wiki, it make them broken as links.

Also, the processing of structure-text sometimes treate utf-8 string with surprise.

ajung (Feb 21, 2002 3:33 pm; Comment #12)
 > kentsin (2002-02-21) --
 >  This may not very directly related: that currently a valid object id can
 >    not contain certain characters. If we choose utf-8 way, then it may
 >    invalid many combinations which contain invalid character as valid
 >    object id for zope. In case of wiki, it make them broken as links. 
 >   
 
That's beyond the scope of that proposal.



( 96 subscribers )