What is the difference between charset and collation in database?

Question:

Every time I go to create a database, I always come across these definitions and I don't really know the difference.

I understand what charset is, but not collation .

Answer:

Charset

Charset defines the set of characters you will use to store data in text form, where each character is assigned a number. A very common example is the ASCII table which is only 256 characters long. If you indicate that your system charset is ASCII, this means that your system will only recognize the 256 characters contained in it. That's why we sometimes see texts without accents on the internet, with a question mark in place of the character, as the page's charset doesn't contain the character that should be displayed. Currently, the default charset is UTF-8, which allows for up to 1,114,112 characters, allowing you to display characters in any language, including Chinese and Japanese.

Collation

Collation is the term used to define the set of rules that the server will use for ordering and comparing texts, that is, how the operators =, >, <, order by, etc. will work. For example, depending on the configured Collation, the system will sort the character 'ö' between the characters 'o' and 'p', using another collation, this character can be sorted in another position. So it can give conflict when making queries that relate tables with different collations. Besides, the collation also defines if the system will differentiate accented characters or if it will be case sensitive, for example the Latin1_General_CI_AS collation defines that the system must treat the characters as case insensitive (CI) and accent sensitive (AS). Examples:

  • latin1_general_ci: There is no distinction between uppercase and lowercase letters. Searching for “test”, records like “Test” or “TEST” will be returned.
  • latin1_general_cs: Distinguishes uppercase and lowercase letters. Searching for "test" will only return "test". Options like “Test” and “TEST” will not be returned.
  • latin1_swedish_ci: Does not distinguish between lowercase and uppercase letters and accented characters with cedilla, that is, the record containing the word “Intuition” will be returned when there is a search for the word “intuition”

Even if you change the database collation, the objects previously created will not have the collation changed, for that you will have to recreate the object.

This subject is quite complex, I tried to summarize as much as possible, I hope it helps.

Scroll to Top