c++ – Wofstream, when trying to write Cyrillic, terminates the stream

Question:

I wrote an application in Visual C++ 2010 express . It turned out that Cyrillic should also be written to the output file. The file must be in utf-8 encoding. Replaced ofstream with owstream . And if earlier the wrong encoding simply came out, now when you try to write a string containing the Cyrillic alphabet. The stream ends at the beginning of the Cyrillic alphabet. ( XML file) Advise on how to work with Cyrillic correctly. Here are the pieces of code.

...  
wofstream xml;  
...  
xml.open("output.xml");  
...  
xml << "path=\""<< fname <<"\" "; // < тут происходит обрыв потока... например строку вида "D:\backup\section один" пишет в файл как D:\backup\section и все дальше в файл ничего не попадает.  
...

I tried the following simplified version … the same result, writes only the Latin alphabet.

#include "stdafx.h"
using namespace std;
int _tmain(int argc, _TCHAR* argv[])
{
wofstream xml;
wstring s0 = L"example cyrilic and latin text Кириллица и латинский текст";
xml.open("output.txt");
xml << L"example cyrilic and latin text Кириллица и латинский текст" << endl;
//xml << s0 << endl; <-тут такой же результат как и выше... записывается только латиница.
return 0;
}

I tried adding setlocale (LC_ALL, "ru_RU.UTF-8"); … the same result.

Yes, in the test simplified it is visible … in the full version .. (the listing is large and split into several files) the problem remains … I got out of the situation by returning back to ofstream. And I write lines using cp2utf ….

void cp2utf( char* str, char* res ) {
static const long utf[ 256 ] = {
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,
31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,
59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,
87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,
111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,1026,1027,8218,
1107,8222,8230,8224,8225,8364,8240,1033,8249,1034,1036,1035,1039,1106,8216,8217,
8220,8221,8226,8211,8212,8250,8482,1113,8250,1114,1116,1115,1119,160,1038,1118,1032,
164,1168,166,167,1025,169,1028,171,172,173,174,1031,176,177,1030,1110,1169,181,182,
183,1105,8470,1108,187,1112,1029,1109,1111,1040,1041,1042,1043,1044,1045,1046,1047,
1048,1049,1050,1051,1052,1053,1054,1055,1056,1057,1058,1059,1060,1061,1062,1063,
1064,1065,1066,1067,1068,1069,1070,1071,1072,1073,1074,1075,1076,1077,1078,1079,
1080,1081,1082,1083,1084,1085,1086,1087,1088,1089,1090,1091,1092,1093,1094,1095,
    1096,1097,1098,1099,1100,1101,1102,1103
};
int cnt = strlen( str ),
i = 0, j = 0;
for(; i < cnt; ++i ) {
    long c = utf[ (unsigned char) str[ i ] ];
    if( c < 0x80 ) {
    res[ j++ ] = c;
    }
    else if( c < 0x800 ) {
        res[ j++ ] = c >> 6 | 0xc0;
        res[ j++ ] = c & 0x3f | 0x80;
    } 
    else if( c < 0x10000 ) {
        res[ j++ ] = c >> 12 | 0xe0;
        res[ j++ ] = c >> 6 & 0x3f | 0x80;
        res[ j++ ] = c & 0x3f | 0x80;
    } 
}
res[ j ] = '\0';
}

Thank you all for your help. Unfortunately, the problem with wofstream did not solve. I achieved the necessary functionality in the manner described above.

Answer:

@ Sphinx , comments are over, so in an answer.

You certainly did the right thing. In practice, I would do the same (only in pure C, but that's not the point.)

It's just weird that it didn't work out with C ++. Of course, I have an example from Linux (I had to deliver the ru_RU.cp1251 locale), by default I have en_US.UTF-8. But in principle, in Windows, d. utf locale, for example the same "en_US.utf-8", so an example might come in handy.

#include <iostream>
#include <fstream>
#include <locale>

using namespace std;

int main (int ac, char *av[])
{
  cout << "Hi\n";
  wofstream os;
  os.open("test.ws");
  os.imbue(locale(av[1] && *av[1] == 'c' ? "ru_RU.cp1251" : "en_US.utf-8"));
  os << "1 aa xaxa xoxo wofstream\n" << L"zz рус бук\n";
  os.close();

  cout << "End\n";
  return 0;
}

And here is the result

avp@avp-xub11:~/hashcode$ g++ wofstream.cpp
avp@avp-xub11:~/hashcode$ ./a.out 
Hi
End
avp@avp-xub11:~/hashcode$ cat test.ws 
1 aa xaxa xoxo wofstream
zz рус бук
avp@avp-xub11:~/hashcode$ ./a.out cp1251
Hi
End
avp@avp-xub11:~/hashcode$ cat test.ws 
1 aa xaxa xoxo wofstream
zz ��� ��
avp@avp-xub11:~/hashcode$ iconv -f cp1251 -t utf-8 test.ws 
1 aa xaxa xoxo wofstream
zz рус бук
avp@avp-xub11:~/hashcode$

I want to note that in the text of the program, Russian letters are stuffed in utf-8. If you remove the L in front of the constant, then there will be no recoding (if you set it).

Also, there is no re-encoding if you replace wofstream with ofstream (naturally, the L modifier for ofstream must be removed).

Scroll to Top