Another example of why strings are terrible

Posted 1 December 2016
Tagged with python, python:hypothesis, unicode

Here’s a programming assumption I used to make, that until today I’d never really thought about: changing the case of a string won’t change its length.

Now, thanks to Hypothesis, I know better:

>>> x = u'İ'
>>> len(x)
1
>>> len(x.lower())
2

I’m not going to pretend I understand enough about Unicode or Python’s string handling to say what’s going on here.

I discovered this while testing a moderately fiddly normalisation routine – this routine would normalise the string to lowercase, unexpectedly tripping a check that it was the right length. If you’d like to see this for yourself, here’s a minimal example:

from hypothesis import given, strategies as st

@given(st.text())
def test_changing_case_preserves_length(xs):
    assert len(xs) == len(xs.lower())

Update, 2 December 2016: David MacIver asked whether this affects Python 2, 3, or both, which I forgot to mention. The behaviour is different: Python 2 lowercases İ to an ASCII i, whereas Python 3 adds a double dot: i̇.

This means that only Python 3 has the bug where the length changes under case folding (whereas Python 2 commits a different sin of throwing away information).

Cory Benfield pointed out that the Unicode standard has explicit character mappings that add or remove characters when changing case, and highlights a nice example in the other direction: when you uppercase the German esszett (ß), you replace it with a double-S.

Finally, Rob Wells wrote a follow-on post that explains this problem in more detail. He also points out the potential confusion of len(): should it count visible characters, or Unicode code points? The Swift String API does a rather good job here: if you haven’t used it, check out Apple’s introductory blog post.