c++ - how to print each character of strings that mix ascii character with unicode? -

April 15, 2015

for example, want create typewriter effects need print strings that:

#include <string> int main(){     std::string st1="ab》cd《ef";     for(int i=0;i<st1.size();i++){         std::string st2=st1.substr(0,i).c_str();         printf("%s\n",st2.c_str());     }     return 0; }

but output is

a ab ab? ab? ab》 ab》c ab》cd ab》cd? ab》cd? ab》cd《 ab》cd《e

and not:

a ab ab》 ab》c ab》cd ab》cd《 ab》cd《e

how know upcoming character unicode?

similar question, print each character has problem:

#include <string> int main(){     std::string st1="ab》cd《ef";     for(int i=0;i<st1.size();i++){         std::string st2=st1.substr(i,1).c_str();         printf("%s\n",st2.c_str());     }     return 0; }

the output is:

a b ? ? ? c d ? ? ? e f

not:

a b 》 c d 《 e f

i think problem encoding. string in utf-8 encoding has variable sized characters. means can not iterate 1 char @ time because characters more 1 char wide.

the fact is, in unicode, can iterate reliably 1 fixed character @ time utf-32 encoding.

so can use utf library icu convert vetween utf-8 , utf-32.

if have c++11 there tools here, std::u32string able hold utf-32 encoded strings:

#include <string> #include <iostream>  #include <unicode/ucnv.h> #include <unicode/uchar.h> #include <unicode/utypes.h>  // convert utf-32 utf-8 std::string to_utf8(std::u32string s) {     uerrorcode status = u_zero_error;     char target[1024];     int32_t len = ucnv_convert(         "utf-8", "utf-32"         , target, sizeof(target)         , (const char*)s.data(), s.size() * sizeof(char32_t)         , &status);     return std::string(target, len); }  // convert utf-8 utf-32 std::u32string to_utf32(const std::string& utf8) {     uerrorcode status = u_zero_error;     char32_t target[256];     int32_t len = ucnv_convert(         "utf-32", "utf-8"         , (char*)target, sizeof(target)         , utf8.data(), utf8.size()         , &status);     return std::u32string(target, (len / sizeof(char32_t))); }  int main() {     // utf-8 input (needs utf-8 editor)     std::string utf8 = "ab》cd《ef"; // utf-8      // convert utf-32     std::u32string utf32 = to_utf32(utf8);      // safe use string indexing     // length starting 1     for(std::size_t = 1; < utf32.size(); ++i)     {         // convert to utf-8 output         // note: + 1 include bom         std::cout << to_utf8(utf32.substr(0, + 1)) << '\n';     } }

output:

a ab ab》 ab》c ab》cd ab》cd《 ab》cd《e ab》cd《ef

note:

the icu library adds bom (byte order mark) @ beginning of strings converts unicode. therefore need deal fact first character of utf-32 string bom. why substring uses i + 1 length parameter include bom.

Search This Blog

Ruby Code

c++ - how to print each character of strings that mix ascii character with unicode? -

Comments

Post a Comment

Popular posts from this blog

java - Spring Data JPA: Why findOne(id) executing delete query internally? -

python - Mongodb How to add addtional information when aggregating? -

java - Incorrect order of records in M-M relationship in hibernate -