We have a bash script that runs on various platforms (from the old Debian GNU / Linux 5.0 (Linux 188.8.131.52, bash 3.2.39) to Red Hat 4.8.2 (Linux 3.10.0, bash 4.2.46)). This script at the input (as a parameter or STDIN) accepts a string that contains many things. This line is processed, the excess is cut out, the total is inserted into the JSON request and sent further. But I was faced with a task that cannot be solved at the moment. And it consists in the following:
It is necessary to create a regular expression that will strip all characters except Latin, Cyrillic, numbers and punctuation marks.
And everything would be fine, in a number of operating systems we have Cyrillic in the source code – it is perceived with hostility. Those. the script works until it becomes necessary to edit / correct it. After trying to edit, due to the view construct:
(namely, because of
А-Яа-яЁё ) saving an open file in the same
nano is problematic. The most logical solution in my opinion is to replace the Cyrillic characters themselves with their codes, but how? Attempts like
\x430-\x44f are unsuccessful. When looking at the hexdump codes, we have the following picture:
printf 'abcd' | hexdump -C; exit 0; $ ./test.sh 00000000 61 62 63 64 |abcd| 00000004 printf 'абвг' | hexdump -C; exit 0; $ ./test.sh test 00000000 d0 b0 d0 b1 d0 b2 d0 b3 |........| 00000008 printf %x "'а"; echo " "; printf %x "'я"; exit 0; $ ./test.sh test 430 44f printf %x "'a"; echo " "; printf %x "'z"; exit 0; $ ./test.sh test 61 7a
Finally, I formulate my question:
What kind of regular expression should be (applicable if possible in a pure bash environment) that gives all characters except Latin, Cyrillic, numbers and punctuation, given that the range of Cyrillic characters should be written as a range of character codes, not the characters themselves.
Update # 1
This is how it works:
echo $string | perl -lpe 's/[^0-9A-Za-z\xDO\x90-\xD0\xBF\xD1\x80-\xD1\x8F]/_/g' , but generates pearl dependency. Thanks to habraucher Shetani for this tip
message_text='qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM абв..эюяАБВ..ЭЮЯ 1234567890~!@#$%^&*()_"`'"'"; string="<!DOCTYPE html><html><body>$message_text</body></html>";
Try replacing the Cyrillic alphabet with the ranges