mysql – What are the differences between utf8 and utf8mb4?

Question:

When importing my mysql database to a windows server after having created it on a local server (xampp), I couldn't import the script that I exported from the database on the server. So I decided to copy the scripts from table by table, and I found that only part of the script gave an error:

ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

Removing all these occurrences of the exported script, I managed to upload the database to the windows server. However, some problems are occurring, such as some pages on the website that have accents changed to symbols and other problems that I don't know if they are due to the absence of the aforementioned line.

I wanted to understand the difference(s) between utf8 and utf8mb4, to see if this could be causing the website problems.

Answer:

Previously, programming languages ​​supported only the ASCII encoding that defines 128 symbols. This encoding is excellent for English, producing very compact texts where each letter takes only one byte. With the growth of the internet and an increasingly globalized world, problems quickly began to arise, such as people in Brazil not being able to use accents in their words. That's when initiatives to create a codification that would bring together all the symbols used all over the world began.

ASCII only defines 128 symbols, which makes the first bit of every byte zero in this encoding. The UTF-8 standard took advantage of this and defined the first 128 symbols to be exactly the same as ASCII. When a character that is not present in this pattern is required, UTF-8 sets the value of the first bit to 1 and sets codes that say whether the character will be 1, 2, 3, or 4 bytes long. So a program that uses UTF-8 will have full compatibility with any ASCII text.

The problem is that MySQL has not fully adhered to the UTF-8 standard. He implemented only symbols up to 3 bytes and forgot about the rest. What is declared in MySQL as utf8 isn't actually UTF-8, it's just a piece of it. To fix this error, starting with version 5.5, MySQL implemented the complete pattern going from 1 to 4 bytes and as it had already used the name utf8 it called its new implementation utf8mb4. In summary MySQL's utf8 is not UTF-8 and utf8mb4 fully follows the UTF-8 standard.

Still, utf8 and utf8mb4 have a great compatibility, the absolute majority of the characters will be the same on both systems. If you switch from one to the other you probably won't see a difference. Unless of course Chinese people start using animals as letters, then they will be upset when #û&ý appears in place of the kittens. Even if you use all the existing accents it wouldn't be any problem!

The point is that the MySQL standard is Latin1 encoding, also known as ISO 8859-1 which defines all the characters of the Latin language and can be used very well in Portuguese. When you stopped declaring UTF-8mb4, MySQL used that encoding and as your application is probably in UTF-8 these patterns don't represent the accents in the same way, but represent ASCII in the same way, so the error only appears in the accents.

Maybe this part of the script went wrong because the MySQL version used doesn't support utf8mb4. If that's the case just use utf8 instead and the accents will be compatible.

Scroll to Top