utf8mb4 and utf8 in MySQL, what is the difference?

Question:

Since the last update of PHPMyAdmin I see that now the default character set is utf8mb4 .

I would like to know what the difference is between utf8mb4 and utf8 and if any specific reason is known why this variant exists, if we can call it that, of utf8.

Also, if I decide to change the character set of my tables and columns to utf8mb4 I would like to know if I could have a problem.

Answer:

Good morning, as the documentation mentions, since MySQL version 5.5.3 this "variant" of utf is added. Now what is the difference?

UTF-8 The UTF-8 encoding can represent each symbol in the Unicode character set, which ranges from U + 000,000 to U + 10FFFF. That's 1,114,112 possible symbols. (Not all of these Unicode code points have assigned characters yet, but that doesn't prevent UTF-8 from being able to encode them.)

Many of the times we have used MySQL's utf8charset for databases, tables, and columns, assuming it maps to the UTF-8 encoding described above. By using utf8, assuming almost any symbol can be stored.

Example:

CREATE TABLE ForgeRock
    (`id` int, `productName` varchar(7), `description` varchar(55))
;

INSERT INTO ForgeRock
    (`id`, `productName`, `description`)
VALUES
    (1, 'OpenIDM', 'Platform for building enterprise provisioning solutions'),
    (2, 'OpenAM', 'Full-featured access management'),
    (3, 'OpenDJ', 'Robust LDAP server for Java')
;


 SET NAMES utf8;

Query OK, 0 rows affected (0.00 sec)

UPDATE ForgeRock SET description = 'foo𝌆bar' WHERE id = 3;

Now see warings:

SHOW WARNINGS\G


+---------+------+------------------------------------------------------------------------------+
| Level   | Code | Message                                                                      |
+---------+------+------------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: '\xF0\x9D\x8C\x86' for column 'description' at row 1 |
+---------+------+------------------------------------------------------------------------------+
1 row in set (0.00 sec)

It turns out MySQL utf8charset only partially implements proper UTF-8 encoding. Symbols consisting of one to three bytes UTF-8-encoded; encoded symbols that occupy four bytes are not supported.

This affects not only the 𝌆character, but the most important symbols like U + 01F4A9 (💩) as well. In total, of the 1,048,575 possible code points it cannot be used. In fact, MySQL's utf8 is only allowed to store 5.88% ((0x00FFFF + 1) / (0x10FFFF + 1)) of all possible Unicode code points. Proper UTF-8 can encode 100% of all Unicode code points.

Now if you want to change the encoding in your tables or db, since utf8mb4 is fully compatible with utf8, just before moving something to it, create a backup of your information.

Scroll to Top