c++ – What determines the encoding of string constants (literals)

Question:

There is quite a lot of material on the net and here on SO about working with encodings and locales. But for some reason there is no intelligible information about the encoding of string constants (literals).

const char * text = "Какая ваша кодировка?";

What determines the encoding of string literals: the encoding of the source file, compiler options, or something else? What does the standard say on this topic? How to reliably find out the encoding of string literals at the compilation stage (maybe there are any macros)? And in runtime?

Answer:

This answer is devoted to the practice of using Microsoft Visual Studio.

Unfortunately, I don’t know of a good cross-platform solution.

  • For non-Unicode source files, the string is interpreted as an ANSI-encoded string. For Russian-speaking systems, this is CP 1251. This is done even if a different encoding is declared for the source file! When compiling on a system with a different ANSI encoding, the result will be different.
  • For source files in Unicode encodings, the string encoding is also Unicode.
    • If the string is "narrow" (that is, of type char[] ), then the string will be converted to ANSI encoding with the same consequences.
    • If the string is "wide" (that is, of the wchar_t[] ), then the string will remain as it is, that is, correct.

This means to use

  • either unicode source encoding + wide strings,
  • or narrow strings in ANSI encoding with loss of compilation under non-Russian systems,
  • or encode literals as numeric constants.

As it turned out as a result of a long discussion with @ixSci and @Abyx, Visual Studio 2015 behaves a little differently: in the case of the file encoding utf-8 and a narrow string, the string will still contain utf-8. But in the case of the utf-16 (ucs-2) file encoding, the result is the same: an attempt to convert to ANSI (which may fail).


Update:

Visual Studio 2015 and later converts strings to internal format. The conversion is determined by the source character set, from which the characters are converted to the internal format (currently it is utf-8). The character set, i.e., in fact, the encoding of the source file is determined as follows

  • If the file contains a BOM , this uniquely identifies its encoding.
  • Otherwise, if the file looks like a file in utf-16 (Visual Studio calculates this from the first eight bytes) big / little endian, then it is considered its encoding.
  • Otherwise, if the /source-charset key is specified during compilation (or in the project settings), the encoding specified in this key is considered the encoding of the input file.
  • Otherwise, the system code page (i.e., ANSI) is considered the encoding of the input file. Please note that this is not the best option, since the same source bytes can be interpreted differently on different systems.

The next important thing is the execution character set. This is, in fact, an encoding into which narrow string / character literals (declared without a prefix) will be converted when writing to an executable file, and which the program will "see" if it scans the lines by bytes. If the /execution-charset switch is specified during compilation, this will be the desired character set. If not, the current code page is used as the character set.

Note that you can specify the /utf-8 switch, which will set both character sets to utf-8 in one fell swoop.

Another character set, the wide execution character set, is used to convert wide character / string literals. It is unchanged in MS Visual Studio and matches utf-16.

Documentation: Visual C ++ Team Blog / New Options for Managing Character Sets in the Microsoft C / C ++ Compiler .

Scroll to Top