python – Why are the letters of the alphabet from "r" to "u" not included in the range "a-z"?

Question:

To be clear : I know that python 2 requires explicitly declaring strings as unicode. I understand that this is not supposed to work correctly. I'm interested in the "anatomy" of the breakdown. What exactly is inside re.compile() and regex.search() that produces such a result?


Judging by the code below, the range 'а-яё' does not include the range 'р-ю' , but the range 'р-ю' does include 'ё' .

mcve.py:

# coding=utf-8

import re

# Это панграмма, она содержит все буквы алфавита
test = 'широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства'

regex1 = re.compile('[а-яА-ЯёЁ\s]+')
regex2 = re.compile('[а-яА-ЯёЁшьрэтфцюыхущчъ\s]+')
regex3 = re.compile('[а-яёшьрэтфцюыхущчъс\s]+')
regex4 = re.compile('[а-яр-ю\s]+')

print regex1.search(test).group()
print regex2.search(test).group()
print regex3.search(test).group()
print regex4.search(test).group()

Result:


extensive electrification of the southern provinces will give a powerful impetus to the rise of agriculture
extensive electrification of the southern provinces will give a powerful impetus to the rise of agriculture
extensive electrification of the southern provinces will give a powerful impetus to the rise of agriculture

I made sure that all the letters of the alphabet from "A" to "Z" and from "a" to "z" are consecutive in Unicode , except for "Ёё", which are explicitly added to the regular expression.

Gradually adding letters, on which the search for the first expression is interrupted, I came to the range а-яА-ЯёЁшьрэтфцюыхущчъ . If you sort the added letters, you get an almost continuous interval: "ртуфхцчшщъыьэю" .

If you remove capital letters, i.e. "[А-ЯЁ]" , then in an unexpected way the search is interrupted by "s". The interval becomes continuous: from "p" to "y". This is regex3 .

And finally, it turns out that now the interval can be collapsed and even remove the "e" ( regex4 ).

What is going on?

 python --version
Python 2.7.6

If you explicitly make a Unicode string and a regular expression, then everything works as it should. But somehow it works without it. Explain how?

test2 = u'широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства'
regex5 = re.compile(u'[а-яА-ЯёЁ\s]+')

Answer:

The easiest way to see what's wrong with the regex is to set the debug flag. And in this case, there is absolutely no need to get into the giblets – they will show the same thing. You can verify this if you put your hands into %python_folder%/Lib/sre_parse.py and add a couple of print s there on line 438 – it looks something like this:

elif this == "[":
            # character set
            set = []
            setappend = set.append
##          if sourcematch(":"):
##              pass # handle character classes
            if sourcematch("^"):
                setappend((NEGATE, None))
            # check remaining characters
            start = set[:]
            while 1:
                this = sourceget()
                if len(this) == 1:
                    # Вот ровно в этом месте парсер перебирает содержимое [] 
                    print(source.tell(), "ORD: ", ord(this))
                if this == "]" and set != start:
                    break

So this print will show everything exactly the same as debug – that the letters are not really letters.

regex1 = re.compile('[а-яА-ЯёЁ\s]+', re.DEBUG)

max_repeat 1 2147483647
  in
    literal 208
    range (176, 209)
    literal 143
    literal 208
    range (144, 208)
    literal 175
    literal 209
    literal 145
    literal 208
    literal 129
    category category_space

It is immediately clear that something is completely unclean, because the first should be range(ord('a')-ord('я') , and instead of it some kind of nonsense. And all because the strings are encoded in UTF8 ( is specified explicitly in the file), and their type is bytes. They can be displayed normally in the terminal if the encodings match. I use Pycharm and its console in UTF8. But if I ran the same thing on a standard Windows terminal, it would display naturally porridge (like this – ╨Я╤А╨╕╨▓╨╡╤В), because the encoding is CP866.

For instance,

print("Привет")
# НО!
for char in "Привет":
    print(ord(char), repr(char))

In the debug output, the number 208 is the first half of the UTF8 character 'a' – 176 is the second half, followed by a hyphen and again the first half of the 'i' character – you can verify this by opening the source in any HEX editor. As a result, the interval is wrong. Accordingly, when the parser iterates over the contents of square brackets, it stumbles upon not letters, but bytes, or rather, half letters in UTF8 encoding. You can simulate the behavior of a regular expression with the following code:

reg_min = 144
reg_max = 209

result = []
for char in test:
    ordedr = ord(char)
    if ordedr >= reg_min and ordedr <= reg_max:
        result.append(char)
    else:
        break
print(ord(regex1.search(test).group()))
print(list(map(ord, result)))

>>> 209
>>> [209]
Scroll to Top