Question:
I started my project, I want to immediately use the best practices.
Realized that one of utf8_general_ci
or utf8_unicode_ci
. I lean towards utf8_unicode_ci
.
But I came across information that this is a little outdated and it is worth using utf8mb4_general_ci
and utf8mb4_unicode_ci
.
Advise which encoding to choose for the database.
Answer:
Free translation of the question What's the difference between utf8_general_ci and utf8_unicode_ci .
Both of these encodings ( utf8_general_ci
and utf8_unicode_ci
) work with UTF-8 characters, the difference is in string sorting and comparison.
Note: since MySQL 5.5.3,
utf8mb4
preferredutf8mb4
utf8
. They are both UTF-8 encodings, but the olderuft8
has MySQL-specific UTF-8 character limits above 0xFFFD.
Comparison by individual parameters.
Accuracy
-
utf8mb4_unicode_ci
based on the Unicode standard for collation and comparison of strings, which sorts strings more accurately across a wide range of languages / alphabets. -
utf8mb4_general_ci
does not implement all Unicode collation rules, which is
often leads to undesirable results in some situations for
certain languages / symbols.
Performance
-
utf8mb4_general_ci
faster in comparison and collation because it contains a large number of optimizations.On modern servers, this increase in speed will always be, but not significantly. The optimizations were conceived at a time when server capacities were significantly less than today.
-
utf8mb4_unicode_ci
, which uses Unicode rules for collation and comparison, honestly uses more sophisticated algorithms for accurate collation for a wide range of languages and using special characters. These rules take into account language-specific conventions, not always sorting according to "alphabetical" order.
In principle, for a group of so-called. For "European" languages, there is not much difference between the strict Unicode collation and the simplified utf8mb4_general_ci
, but a few differences:
For example, Unicode sorts "ß" like "ss" and "Œ" like "OE" just like humans do, while utf8mb4_general_ci
sorts them as separate characters (presumably like "s" and "e" respectively).
Some Unicode characters are defined to be insignificant, which means that they should not affect the sort order and comparison should proceed to the next character. And utf8mb4_unicode_ci
handles these characters correctly.
For a group of non-European languages such as Asian languages or languages with different alphabets, there is much more difference between Unicode collation and simplified collation in utf8mb4_general_ci
. How well utf8mb4_general_ci
is utf8mb4_general_ci
will depend on the specific language. For some languages, the difference may be grossly insufficient.
What should you use?
There is practically no point in preferring utf8mb4_general_ci
for performance reasons, because on modern processors the difference will not act as a bottleneck.
There may be some difference in performance in some overly specialized situations and if this is your case you should be aware of this.
Previously, some experts recommended using utf8mb4_general_ci
except in cases where precise sorting is necessary and this is more important than performance degradation. Today, there is more emphasis on accurate support for internationalization than on minor performance drawdowns.
And I will also add that even if your application must support only English, there may be a situation when the application will enter the names of people and frequently entered names must contain characters that are found in other languages, so it is so important to use the correct sorting rules … Using Unicode wherever possible will help you develop better applications.