Question:
There is a string like: "Привет"
. It is checked via if
, if so, then the value is 1
, otherwise 0
.
But if you enter: "привет"
or "привеТ"
or "пРивет"
, etc., then the value will be 0
.
How to make it so that there is a case-ignoring character in a string?
Answer:
It would seem that you can simply take and bring both strings to a single case (upper or lower), but everything is not so simple. There is text for which text.lower() != text.upper().lower()
, like "ß"
:
"ß".lower()
>>> 'ß'
"ß".upper().lower()
>>> 'ss'
Let's say you need to compare "BUSSE"
and "Buße"
, or even "BUSSE"
and "BUẞE"
– these are all considered the same words in German. The recommended way is to use the casefold
method, which converts the string into a form suitable for case-insensitive comparison.
>>> "BUSSE".casefold() == "Buße".casefold()
True
But that's not all. If your text renders correctly, in the following example you might think that 'Й' == 'Й'
, but it's not:
>>> 'Й' == 'Й'
False
The fact is that the first Й
is one character (U+0419), and the second Й
is a combination of two (U+0418 and U+0306):
>>> import unicodedata
>>> [unicodedata.name(char) for char in 'Й']
['CYRILLIC CAPITAL LETTER SHORT I']
>>> [unicodedata.name(char) for char in 'Й']
['CYRILLIC CAPITAL LETTER I', 'COMBINING BREVE']
If you need to treat such strings as the same, then the easiest way to deal with this is to use unicodedata.normalize
. You should probably use NFKD normalization, butthe documentation suggests other options ; You can choose what suits your task. Then:
>>> unicodedata.normalize('NFKD', 'Й') == unicodedata.normalize('NFKD', 'Й')
True
Putting it all together, you can use functions like this:
import unicodedata
def normalize_caseless(text):
return unicodedata.normalize("NFKD", text.casefold())
def caseless_equal(left, right):
return normalize_caseless(left) == normalize_caseless(right)
>>> caseless_equal('BUSSE', 'Buße')
True
>>> caseless_equal('Й', 'Й')
True
Free translation of the answer from Veedrac with enSO. There are helpful comments, you can read them too.