Another example of why strings are terrible
Here’s a programming assumption I used to make, that until today I’d never really thought about: changing the case of a string won’t change its length.
Now, thanks to Hypothesis, I know better:
>>> x = u'İ'
>>> len(x)
1
>>> len(x.lower())
2
I’m not going to pretend I understand enough about Unicode or Python’s string handling to say what’s going on here.
I discovered this while testing a moderately fiddly normalisation routine – this routine would normalise the string to lowercase, unexpectedly tripping a check that it was the right length. If you’d like to see this for yourself, here’s a minimal example:
from hypothesis import given, strategies as st
@given(st.text())
def test_changing_case_preserves_length(xs):
assert len(xs) == len(xs.lower())
Update, 2 December 2016: David MacIver asked whether this affects Python 2, 3, or both, which I forgot to mention. The behaviour is different: Python 2 lowercases
İ
to an ASCIIi
, whereas Python 3 adds a double dot:i̇
.This means that only Python 3 has the bug where the length changes under case folding (whereas Python 2 commits a different sin of throwing away information).
Cory Benfield pointed out that the Unicode standard has explicit character mappings that add or remove characters when changing case, and highlights a nice example in the other direction: when you uppercase the German esszett (ß), you replace it with a double-S.
Finally, Rob Wells wrote a follow-on post that explains this problem in more detail. He also points out the potential confusion of
len()
: should it count visible characters, or Unicode code points? The Swift String API does a rather good job here: if you haven’t used it, check out Apple’s introductory blog post.