c++ - how to print each character of strings that mix ascii character with unicode? -
for example, want create typewriter effects need print strings that:
#include <string> int main(){ std::string st1="ab》cd《ef"; for(int i=0;i<st1.size();i++){ std::string st2=st1.substr(0,i).c_str(); printf("%s\n",st2.c_str()); } return 0; } but output is
a ab ab? ab? ab》 ab》c ab》cd ab》cd? ab》cd? ab》cd《 ab》cd《e and not:
a ab ab》 ab》c ab》cd ab》cd《 ab》cd《e how know upcoming character unicode?
similar question, print each character has problem:
#include <string> int main(){ std::string st1="ab》cd《ef"; for(int i=0;i<st1.size();i++){ std::string st2=st1.substr(i,1).c_str(); printf("%s\n",st2.c_str()); } return 0; } the output is:
a b ? ? ? c d ? ? ? e f not:
a b 》 c d 《 e f
i think problem encoding. string in utf-8 encoding has variable sized characters. means can not iterate 1 char @ time because characters more 1 char wide.
the fact is, in unicode, can iterate reliably 1 fixed character @ time utf-32 encoding.
so can use utf library icu convert vetween utf-8 , utf-32.
if have c++11 there tools here, std::u32string able hold utf-32 encoded strings:
#include <string> #include <iostream> #include <unicode/ucnv.h> #include <unicode/uchar.h> #include <unicode/utypes.h> // convert utf-32 utf-8 std::string to_utf8(std::u32string s) { uerrorcode status = u_zero_error; char target[1024]; int32_t len = ucnv_convert( "utf-8", "utf-32" , target, sizeof(target) , (const char*)s.data(), s.size() * sizeof(char32_t) , &status); return std::string(target, len); } // convert utf-8 utf-32 std::u32string to_utf32(const std::string& utf8) { uerrorcode status = u_zero_error; char32_t target[256]; int32_t len = ucnv_convert( "utf-32", "utf-8" , (char*)target, sizeof(target) , utf8.data(), utf8.size() , &status); return std::u32string(target, (len / sizeof(char32_t))); } int main() { // utf-8 input (needs utf-8 editor) std::string utf8 = "ab》cd《ef"; // utf-8 // convert utf-32 std::u32string utf32 = to_utf32(utf8); // safe use string indexing // length starting 1 for(std::size_t = 1; < utf32.size(); ++i) { // convert to utf-8 output // note: + 1 include bom std::cout << to_utf8(utf32.substr(0, + 1)) << '\n'; } } output:
a ab ab》 ab》c ab》cd ab》cd《 ab》cd《e ab》cd《ef note:
the icu library adds bom (byte order mark) @ beginning of strings converts unicode. therefore need deal fact first character of utf-32 string bom. why substring uses i + 1 length parameter include bom.
Comments
Post a Comment