python – The basic method of the string and the search for the string with a regular expression do not work

Question:

There is a very primitive code:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import re
from string import *

text = 'тут будет внедорожник (Туссан, туарег) или Ровер'
text = text.lower()
#text = text.decode('utf-8')
print text

model_list = ['туссан', 'туарег', 'королла', 'нексия']

for model in model_list:
    response = re.search(model, text, re.IGNORECASE)
    if response:
        print 'найдено: ' + response.group(0)

The task is simple – there is a list of models and there is a line of text, for each list model you need to find an entry in the line, and here is the problem:

  1. this code in the form as above in the console writes

    there will be an SUV (Tussan, Tuareg) or Rover

    found: tuareg

but must clearly find two cars

  1. if you set a variable like this text = u'тот же самый текст ...' or rewrite the string text = text.lower() as text = text.decode('utf-8').lower() , then the following is already in the console :

    there will be an SUV (Tussan, Tuareg) or a rover

those. the case has become lower, but the search does not occur. Moreover, it's not clear why re.I (or re.IGNORECASE ) doesn't work at all. I tried to add re.I|re.U|re.S|re.M , nothing happens.

What is the problem here? I just need to check a few words in a piece of text, while not case sensitive.

Answer:

The problem is in the encoding. You can solve it in several ways (sorted by my subjective perception of the convenience of the ways):

  1. Switch to Python 3

    Python 3 uses unicode literals by default, so you won't have these problems. In the code, it is enough to use the print function instead of the operator, i.e. add function call parentheses:

     import re text = 'тут будет внедорожник (Туссан, туарег) или Ровер' model_list = ['туссан', 'туарег', 'королла', 'нексия'] for model in model_list: response = re.search(model, text, re.IGNORECASE) if response: print('найдено: ' + response.group(0))

    Note also that in Python 3 we don't need to manually specify the encoding of the source file: Python 3 defaults to utf-8 encoding.

  2. Insert a line at the beginning of the file

     from __future__ import unicode_literals

    This line says that all string literals in the file will be of type unicode . It is especially useful to use this feature to write portable code between versions of Python (in combination with the rest of the __future__ module). It will look like this:

     # coding: utf-8 from __future__ import unicode_literals import re text = 'тут будет внедорожник (Туссан, туарег) или Ровер' model_list = ['туссан', 'туарег', 'королла', 'нексия'] for model in model_list: response = re.search(model, text, re.IGNORECASE | re.UNICODE) if response: print 'найдено: ' + response.group(0)
  3. Convert used strings to unicode. It so happened that the regular expression module does not correctly process Cyrillic in byte strings (it cannot search case-insensitively). You can call the conversion to unicode strings for the desired variables:

     # coding: utf-8 import re text = 'тут будет внедорожник (Туссан, туарег) или Ровер' model_list = ['туссан', 'туарег', 'королла', 'нексия'] text = text.decode('utf-8') for model in model_list: model = model.decode('utf-8') response = re.search(model, text, re.IGNORECASE | re.UNICODE) if response: print u'найдено: ' + response.group(0)

Note that when outputting the result, the string u'найдено' is tagged with unicode. This is important because you can only add strings of the same type, and response.group(0) returns us a unicode string.

  1. Use unicode literals manually: (former method #3)

     # coding: utf-8 import re text = u'тут будет внедорожник (Туссан, туарег) или Ровер' model_list = [u'туссан', u'туарег', u'королла', u'нексия'] for model in model_list: response = re.search(model, text, re.IGNORECASE | re.UNICODE) if response: print u'найдено: ' + response.group(0)

It is important to note that if string encodings are properly taken into account, you can use case-insensitive search, but you need to add the re.UNICODE flag in order for the module to work correctly on Unicode strings.

Also, try to use Python 3 in new projects, unless there are strict requirements to use Python 2 version.

Scroll to Top