linux – Bash, regexp and cyrillic

Question:

We have a bash script that runs on various platforms (from the old Debian GNU / Linux 5.0 (Linux 2.6.32.11, bash 3.2.39) to Red Hat 4.8.2 (Linux 3.10.0, bash 4.2.46)). This script at the input (as a parameter or STDIN) accepts a string that contains many things. This line is processed, the excess is cut out, the total is inserted into the JSON request and sent further. But I was faced with a task that cannot be solved at the moment. And it consists in the following:

It is necessary to create a regular expression that will strip all characters except Latin, Cyrillic, numbers and punctuation marks.

And everything would be fine, in a number of operating systems we have Cyrillic in the source code – it is perceived with hostility. Those. the script works until it becomes necessary to edit / correct it. After trying to edit, due to the view construct:

string=${string//[^0-9A-Za-zА-Яа-яЁё]/_};

(namely, because of А-Яа-яЁё ) saving an open file in the same nano is problematic. The most logical solution in my opinion is to replace the Cyrillic characters themselves with their codes, but how? Attempts like \430-\44f \u430-\u44f \x430-\x44f are unsuccessful. When looking at the hexdump codes, we have the following picture:

printf 'abcd' | hexdump -C; exit 0;
$ ./test.sh
00000000  61 62 63 64                                       |abcd|
00000004

printf 'абвг' | hexdump -C; exit 0;
$ ./test.sh test
00000000  d0 b0 d0 b1 d0 b2 d0 b3                           |........|
00000008

printf %x "'а"; echo " "; printf %x "'я"; exit 0;
$ ./test.sh test
430
44f

printf %x "'a"; echo " "; printf %x "'z"; exit 0;
$ ./test.sh test
61
7a

Finally, I formulate my question:

What kind of regular expression should be (applicable if possible in a pure bash environment) that gives all characters except Latin, Cyrillic, numbers and punctuation, given that the range of Cyrillic characters should be written as a range of character codes, not the characters themselves.

Update # 1

This is how it works: echo $string | perl -lpe 's/[^0-9A-Za-z\xDO\x90-\xD0\xBF\xD1\x80-\xD1\x8F]/_/g' , but generates pearl dependency. Thanks to habraucher Shetani for this tip

Update 2

Rule:

string=${string//[^0-9A-Za-z\xDO\x90-\xd0\xbf\xd1\x80-\xd1\x8f\xd0\x81\xd1\x91]/_};

Input:

message_text='qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM абв..эюяАБВ..ЭЮЯ 1234567890~!@#$%^&*()_"`'"'";
string="<!DOCTYPE html><html><body>$message_text</body></html>";

Conclusion:

__DOCTYPE_html__html__body_qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM__________________1234567890_________________body___html_

Answer:

Try replacing the Cyrillic alphabet with the ranges \xDO\x90-\xd0\xbf and \xd1\x80-\xd1\x8f Ё \xd0\x81 ё \xd1\x91

string=${string//[^0-9A-Za-z\xDO\x90-\xd0\xbf\xd1\x80-\xd1\x8f\xd0\x81\xd1\x91]/_};
Scroll to Top