I started my project, I want to immediately use the best practices.
Realized that one of
utf8_unicode_ci . I lean towards
But I came across information that this is a little outdated and it is worth using
Advise which encoding to choose for the database.
Free translation of the question What's the difference between utf8_general_ci and utf8_unicode_ci .
Both of these encodings (
utf8_unicode_ci ) work with UTF-8 characters, the difference is in string sorting and comparison.
Note: since MySQL 5.5.3,
utf8. They are both UTF-8 encodings, but the older
uft8has MySQL-specific UTF-8 character limits above 0xFFFD.
Comparison by individual parameters.
utf8mb4_unicode_cibased on the Unicode standard for collation and comparison of strings, which sorts strings more accurately across a wide range of languages / alphabets.
utf8mb4_general_cidoes not implement all Unicode collation rules, which is
often leads to undesirable results in some situations for
certain languages / symbols.
utf8mb4_general_cifaster in comparison and collation because it contains a large number of optimizations.
On modern servers, this increase in speed will always be, but not significantly. The optimizations were conceived at a time when server capacities were significantly less than today.
utf8mb4_unicode_ci, which uses Unicode rules for collation and comparison, honestly uses more sophisticated algorithms for accurate collation for a wide range of languages and using special characters. These rules take into account language-specific conventions, not always sorting according to "alphabetical" order.
In principle, for a group of so-called. For "European" languages, there is not much difference between the strict Unicode collation and the simplified
utf8mb4_general_ci , but a few differences:
For example, Unicode sorts "ß" like "ss" and "Œ" like "OE" just like humans do, while
utf8mb4_general_ci sorts them as separate characters (presumably like "s" and "e" respectively).
Some Unicode characters are defined to be insignificant, which means that they should not affect the sort order and comparison should proceed to the next character. And
utf8mb4_unicode_ci handles these characters correctly.
For a group of non-European languages such as Asian languages or languages with different alphabets, there is much more difference between Unicode collation and simplified collation in
utf8mb4_general_ci . How well
utf8mb4_general_ci will depend on the specific language. For some languages, the difference may be grossly insufficient.
What should you use?
There is practically no point in preferring
utf8mb4_general_ci for performance reasons, because on modern processors the difference will not act as a bottleneck.
There may be some difference in performance in some overly specialized situations and if this is your case you should be aware of this.
Previously, some experts recommended using
utf8mb4_general_ci except in cases where precise sorting is necessary and this is more important than performance degradation. Today, there is more emphasis on accurate support for internationalization than on minor performance drawdowns.
And I will also add that even if your application must support only English, there may be a situation when the application will enter the names of people and frequently entered names must contain characters that are found in other languages, so it is so important to use the correct sorting rules … Using Unicode wherever possible will help you develop better applications.