python – Why are letters of the alphabet from "p" to "u" not included in the range "a-z"?

Question:

Let me explain right away : I know that python 2 requires explicitly declaring strings as unicode. I understand that this should not work correctly. I'm interested in the "anatomy" of a breakdown. What exactly is inside re.compile() and regex.search() produces such a result?


Judging by the code below, the 'а-яё' range does not include the 'р-ю' 'ё' range, but the 'р-ю' 'ё' range includes 'ё' .

mcve.py:

# coding=utf-8

import re

# Это панграмма, она содержит все буквы алфавита
test = 'широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства'

regex1 = re.compile('[а-яА-ЯёЁ\s]+')
regex2 = re.compile('[а-яА-ЯёЁшьрэтфцюыхущчъ\s]+')
regex3 = re.compile('[а-яёшьрэтфцюыхущчъс\s]+')
regex4 = re.compile('[а-яр-ю\s]+')

print regex1.search(test).group()
print regex2.search(test).group()
print regex3.search(test).group()
print regex4.search(test).group()

Result:


widespread electrification of the southern provinces will give a powerful impetus to the rise of agriculture
widespread electrification of the southern provinces will give a powerful impetus to the rise of agriculture
widespread electrification of the southern provinces will give a powerful impetus to the rise of agriculture

I made sure that all letters of the alphabet from "A" to "Z" and from "a" to "z" are in Unicode in a row , except for "Yo", which are explicitly added to the regular expression.

Gradually adding letters on which the search for the first expression is interrupted, I came to the range а-яА-ЯёЁшьрэтфцюыхущчъ . If you sort the added letters, you get an almost solid interval: "ртуфхцчшщъыьэю" .

If you remove the capital letters, i.e. "[А-ЯЁ]" , the search is unexpectedly interrupted with "c". The interval becomes solid: from "p" to "u". This is regex3 .

And finally, it turns out that now the interval can be collapsed and even removed "ё" ( regex4 ).

What is going on?

 python --version
Python 2.7.6

If you explicitly make a unicode string and a regular expression, then everything works as it should. But it somehow works without it. Explain how?

test2 = u'широкая электрификация южных губерний даст мощный толчок подъёму сельского хозяйства'
regex5 = re.compile(u'[а-яА-ЯёЁ\s]+')

Answer:

The easiest way to see what's wrong with the regex is to set the debug flag. And in this case, there is absolutely no need to go into the giblets – they will show the same thing. You can verify if your hands to climb in %python_folder%/Lib/sre_parse.py and add there a couple of print Female on line 438 – it looks like something Taktak:

elif this == "[":
            # character set
            set = []
            setappend = set.append
##          if sourcematch(":"):
##              pass # handle character classes
            if sourcematch("^"):
                setappend((NEGATE, None))
            # check remaining characters
            start = set[:]
            while 1:
                this = sourceget()
                if len(this) == 1:
                    # Вот ровно в этом месте парсер перебирает содержимое [] 
                    print(source.tell(), "ORD: ", ord(this))
                if this == "]" and set != start:
                    break

So this print will show everything exactly the same as debug – that the letters are not actually letters.

regex1 = re.compile('[а-яА-ЯёЁ\s]+', re.DEBUG)

max_repeat 1 2147483647
  in
    literal 208
    range (176, 209)
    literal 143
    literal 208
    range (144, 208)
    literal 175
    literal 209
    literal 145
    literal 208
    literal 129
    category category_space

You can immediately see that something is completely unclean, because the first should be range(ord('a')-ord('я') , and instead of it some kind of nonsense. And all from the fact that the strings are encoded in UTF8 ( indicated explicitly in the file), and their type is bytes. They can be displayed normally in the terminal if the encodings are the same. I use Pycharm and its console in UTF8. But if I ran the same thing on a standard Windows terminal, it would display naturally porridge (like this – ╨Я╤А╨╕╨▓╨╡╤В), because the encoding is CP866.

For instance,

print("Привет")
# НО!
for char in "Привет":
    print(ord(char), repr(char))

In debug output, the number 208 is the first half of the UTF8 character 'a' – 176 – the second half, followed by a hyphen and again the first half of the character 'i' – you can verify this by opening the source in any HEX editor. As a result, the interval is incorrect. Accordingly, when the parser iterates over the contents of the square brackets, it stumbles not on letters, but on bytes, or rather on halves of letters in UTF8 encoding. You can simulate the behavior of the regular pattern with the following code:

reg_min = 144
reg_max = 209

result = []
for char in test:
    ordedr = ord(char)
    if ordedr >= reg_min and ordedr <= reg_max:
        result.append(char)
    else:
        break
print(ord(regex1.search(test).group()))
print(list(map(ord, result)))

>>> 209
>>> [209]
Scroll to Top