Question:
We have a bash script that runs on various platforms (from the old Debian GNU / Linux 5.0 (Linux 2.6.32.11, bash 3.2.39) to Red Hat 4.8.2 (Linux 3.10.0, bash 4.2.46)). This script at the input (as a parameter or STDIN) accepts a string that contains many things. This line is processed, the excess is cut out, the total is inserted into the JSON request and sent further. But I was faced with a task that cannot be solved at the moment. And it consists in the following:
It is necessary to create a regular expression that will strip all characters except Latin, Cyrillic, numbers and punctuation marks.
And everything would be fine, in a number of operating systems we have Cyrillic in the source code – it is perceived with hostility. Those. the script works until it becomes necessary to edit / correct it. After trying to edit, due to the view construct:
string=${string//[^0-9A-Za-zА-Яа-яЁё]/_};
(namely, because of А-Яа-яЁё
) saving an open file in the same nano
is problematic. The most logical solution in my opinion is to replace the Cyrillic characters themselves with their codes, but how? Attempts like \430-\44f
\u430-\u44f
\x430-\x44f
are unsuccessful. When looking at the hexdump codes, we have the following picture:
printf 'abcd' | hexdump -C; exit 0;
$ ./test.sh
00000000 61 62 63 64 |abcd|
00000004
printf 'абвг' | hexdump -C; exit 0;
$ ./test.sh test
00000000 d0 b0 d0 b1 d0 b2 d0 b3 |........|
00000008
printf %x "'а"; echo " "; printf %x "'я"; exit 0;
$ ./test.sh test
430
44f
printf %x "'a"; echo " "; printf %x "'z"; exit 0;
$ ./test.sh test
61
7a
Finally, I formulate my question:
What kind of regular expression should be (applicable if possible in a pure bash environment) that gives all characters except Latin, Cyrillic, numbers and punctuation, given that the range of Cyrillic characters should be written as a range of character codes, not the characters themselves.
Update # 1
This is how it works: echo $string | perl -lpe 's/[^0-9A-Za-z\xDO\x90-\xD0\xBF\xD1\x80-\xD1\x8F]/_/g'
, but generates pearl dependency. Thanks to habraucher Shetani for this tip
Update 2
Rule:
string=${string//[^0-9A-Za-z\xDO\x90-\xd0\xbf\xd1\x80-\xd1\x8f\xd0\x81\xd1\x91]/_};
Input:
message_text='qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM абв..эюяАБВ..ЭЮЯ 1234567890~!@#$%^&*()_"`'"'";
string="<!DOCTYPE html><html><body>$message_text</body></html>";
Conclusion:
__DOCTYPE_html__html__body_qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM__________________1234567890_________________body___html_
Answer:
Try replacing the Cyrillic alphabet with the ranges \xDO\x90-\xd0\xbf
and \xd1\x80-\xd1\x8f
Ё \xd0\x81
ё \xd1\x91
string=${string//[^0-9A-Za-z\xDO\x90-\xd0\xbf\xd1\x80-\xd1\x8f\xd0\x81\xd1\x91]/_};