language-agnostic – Ghost characters? What are they for and where are they documented or identified?

Question:

‌I know that many of you already know that there are ghost characters. Yes yes, characters that are not seen and that are nevertheless there.

It has already happened that in some questions the results do not give, that two data that apparently are the same, turns out that they are not.

Let's look at a very simple example. For convenience I am going to use JS / HTML. But what we will see here can happen to us in any high or low level programming language, or it can happen to us by comparing data stored in a database, in a file, etc, etc.

var txtUno = document.getElementById("txtUno").innerHTML;
var txtDos = document.getElementById("txtDos").innerHTML;
var resultado= (txtUno==txtDos) ? "Iguales" : "Distintos";
console.log(resultado);
<p id="txtUno">‌a</p>
<p id="txtDos">a</p>

Did you see? In each element p there is apparently a letter a , however the code says that they are not the same.

Let's take the same data and count how many characters are in each p element:

var sizeOne = document.getElementById("txtUno").innerHTML.length;
var sizeTwo = document.getElementById("txtDos").innerHTML.length;

console.log(sizeOne);
console.log(sizeTwo);
<p id="txtUno">‌a</p>
<p id="txtDos">a</p>

OMG, it says that in the first p there are two characters and in the second there is only one. Indeed, in the first p there is one of those ghost characters.

I wonder:

  • What are these characters for, other than making us break our heads at times?
  • Are those characters listed, identified somewhere that you can go to when you need to clean up data?

NOTE: I did not know what to label this question. lenguaje-agnóstico would seem to be the most appropriate. Although I have put example code using JS / HTML I have done it more for ease than anything else (to show examples with the code snippets). However, I think an answer based on each character set or something like that would do.

Answer:

Indeed, these ghost characters come from the Unicode character map and have their use depending on the character, language and the way to interpret the string, some are control methods used for writing languages ​​from right to left, others are operators, accents or controls of union of letters.

Unicode on its official page has a list of the characters that are not displayed on the screen and the cases in which it occurs, as well as a detail of how the interpretation is broken down when they are supported or not, especially in browsers that do not have support for this type. of characters where the effect is more noticeable but not limited to these.

Unicode.org non-printable characters FAQ

The part that programmers get mad at them comes from bad practices of not specifying the encoding of our text, and the way it is interpreted on the screen, the following article (this in English) provides excellent details of what accompanies not specifying the type of encoding in our applications, which at the regional level may not be a problem until we are faced with internationalization.

Article on character tables and unicode detailing bad practices carried over from years ago

Regarding UTF-8

In UTF-8 the RFC 3629 standard marks that the code points U + D800 to U + DFFF and those after U + 10FFFF must be treated as invalid sequences, some implementations allow to carry out the restrictions of the RFC 3629 standard in other cases UTF-8 is implemented as one of its extensions CESU-8 (MySQl or Oracle uses this implementation), MUTF-8 (Java uses this implementation) or WTF-8 sometimes mistakenly identified as UTF-8

So it may be that different systems interpret UTF-8 differently if they use the implementation of a UTF-8 extension with the mentioned code points.

Scroll to Top