Convert from UTF-8 to ISO8859-15 in C++












1














I would like to do a conversion from UTF-8 to ISO 8859-15 in C/C++, without including an additional library.



How can I achieve this?



I have found the following piece of code that works for ISO 8859-1 but I'm not sure about how to handle the differences between ISO 8859-15 and ISO 8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-15) :



std::string UTF8toISO8859_1(const char * in) {
std::string out;
if (in == NULL)
return out;

unsigned int codepoint;
while (*in != 0) {
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
if (codepoint <= 255) {
out.append(1, static_cast<char>(codepoint));
}
else {
out.append("?");
}
}
}
return out;
}









share|improve this question


















  • 1




    Maybe this can help Comparing ISO-8859-1 and ISO-8859-15?
    – Robert Andrzejuk
    Nov 12 '18 at 20:58
















1














I would like to do a conversion from UTF-8 to ISO 8859-15 in C/C++, without including an additional library.



How can I achieve this?



I have found the following piece of code that works for ISO 8859-1 but I'm not sure about how to handle the differences between ISO 8859-15 and ISO 8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-15) :



std::string UTF8toISO8859_1(const char * in) {
std::string out;
if (in == NULL)
return out;

unsigned int codepoint;
while (*in != 0) {
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
if (codepoint <= 255) {
out.append(1, static_cast<char>(codepoint));
}
else {
out.append("?");
}
}
}
return out;
}









share|improve this question


















  • 1




    Maybe this can help Comparing ISO-8859-1 and ISO-8859-15?
    – Robert Andrzejuk
    Nov 12 '18 at 20:58














1












1








1







I would like to do a conversion from UTF-8 to ISO 8859-15 in C/C++, without including an additional library.



How can I achieve this?



I have found the following piece of code that works for ISO 8859-1 but I'm not sure about how to handle the differences between ISO 8859-15 and ISO 8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-15) :



std::string UTF8toISO8859_1(const char * in) {
std::string out;
if (in == NULL)
return out;

unsigned int codepoint;
while (*in != 0) {
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
if (codepoint <= 255) {
out.append(1, static_cast<char>(codepoint));
}
else {
out.append("?");
}
}
}
return out;
}









share|improve this question













I would like to do a conversion from UTF-8 to ISO 8859-15 in C/C++, without including an additional library.



How can I achieve this?



I have found the following piece of code that works for ISO 8859-1 but I'm not sure about how to handle the differences between ISO 8859-15 and ISO 8859-1 (https://en.wikipedia.org/wiki/ISO/IEC_8859-15) :



std::string UTF8toISO8859_1(const char * in) {
std::string out;
if (in == NULL)
return out;

unsigned int codepoint;
while (*in != 0) {
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;
if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
if (codepoint <= 255) {
out.append(1, static_cast<char>(codepoint));
}
else {
out.append("?");
}
}
}
return out;
}






c++ string encoding utf-8 iso-8859-15






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 12 '18 at 20:14









KamchatkaKamchatka

2,18122761




2,18122761








  • 1




    Maybe this can help Comparing ISO-8859-1 and ISO-8859-15?
    – Robert Andrzejuk
    Nov 12 '18 at 20:58














  • 1




    Maybe this can help Comparing ISO-8859-1 and ISO-8859-15?
    – Robert Andrzejuk
    Nov 12 '18 at 20:58








1




1




Maybe this can help Comparing ISO-8859-1 and ISO-8859-15?
– Robert Andrzejuk
Nov 12 '18 at 20:58




Maybe this can help Comparing ISO-8859-1 and ISO-8859-15?
– Robert Andrzejuk
Nov 12 '18 at 20:58












1 Answer
1






active

oldest

votes


















1














I like this code. It's surprisingly short. Most of the code just deals with decoding multi-byte sequences into codepoints. Once a codepoint has been decoded, the conversion to ISO-8859-1 is very simple:




  • If it's less or equal 255, it's also a valid ISO-8859-1 character: out.append(1, static_cast<char>(codepoint));

  • If not, it cannot be represented in ISO-8859-1 and is replaced with a question mark: out.append("?");


So to make it work for ISO-8859-15, more code is needed to handle the characters that have been replaced when ISO-8859-15 was introduced (see Comparing ISO-8859-1 and ISO-8859-15). Unfortunately, it considerably increases the code size.



The below code is supposed to be easy to understand. It can be optimized for better performance if that's a main concern.



std::string UTF8toISO8859_1(const char * in) {
std::string out;
if (in == NULL)
return out;

unsigned int codepoint;
while (*in != 0) {
unsigned char ch = static_cast<unsigned char>(*in);
if (ch <= 0x7f)
codepoint = ch;
else if (ch <= 0xbf)
codepoint = (codepoint << 6) | (ch & 0x3f);
else if (ch <= 0xdf)
codepoint = ch & 0x1f;
else if (ch <= 0xef)
codepoint = ch & 0x0f;
else
codepoint = ch & 0x07;
++in;

if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
// a valid codepoint has been decoded; convert it to ISO-8859-15
char outc;
if (codepoint <= 255) {
// codepoints up to 255 can be directly converted wit a few exceptions
if (codepoint != 0xa4 && codepoint != 0xa6 && codepoint != 0xa8
&& codepoint != 0xb4 && codepoint != 0xb8 && codepoint != 0xbc
&& codepoint != 0xbd && codepoint != 0xbe) {
outc = static_cast<char>(codepoint);
}
else {
outc = '?';
}
}
else {
// With a few exceptions, codepoints above 255 cannot be converted
if (codepoint == 0x20AC) {
outc = 0xa4;
}
else if (codepoint == 0x0160) {
outc = 0xa6;
}
else if (codepoint == 0x0161) {
outc = 0xa8;
}
else if (codepoint == 0x017d) {
outc = 0xb4;
}
else if (codepoint == 0x017e) {
outc = 0xb8;
}
else if (codepoint == 0x0152) {
outc = 0xbc;
}
else if (codepoint == 0x0153) {
outc = 0xbd;
}
else if (codepoint == 0x0178) {
outc = 0xbe;
}
else {
outc = '?';
}
}
out.append(1, outc);
}
}
return out;
}





share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53269432%2fconvert-from-utf-8-to-iso8859-15-in-c%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    I like this code. It's surprisingly short. Most of the code just deals with decoding multi-byte sequences into codepoints. Once a codepoint has been decoded, the conversion to ISO-8859-1 is very simple:




    • If it's less or equal 255, it's also a valid ISO-8859-1 character: out.append(1, static_cast<char>(codepoint));

    • If not, it cannot be represented in ISO-8859-1 and is replaced with a question mark: out.append("?");


    So to make it work for ISO-8859-15, more code is needed to handle the characters that have been replaced when ISO-8859-15 was introduced (see Comparing ISO-8859-1 and ISO-8859-15). Unfortunately, it considerably increases the code size.



    The below code is supposed to be easy to understand. It can be optimized for better performance if that's a main concern.



    std::string UTF8toISO8859_1(const char * in) {
    std::string out;
    if (in == NULL)
    return out;

    unsigned int codepoint;
    while (*in != 0) {
    unsigned char ch = static_cast<unsigned char>(*in);
    if (ch <= 0x7f)
    codepoint = ch;
    else if (ch <= 0xbf)
    codepoint = (codepoint << 6) | (ch & 0x3f);
    else if (ch <= 0xdf)
    codepoint = ch & 0x1f;
    else if (ch <= 0xef)
    codepoint = ch & 0x0f;
    else
    codepoint = ch & 0x07;
    ++in;

    if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
    // a valid codepoint has been decoded; convert it to ISO-8859-15
    char outc;
    if (codepoint <= 255) {
    // codepoints up to 255 can be directly converted wit a few exceptions
    if (codepoint != 0xa4 && codepoint != 0xa6 && codepoint != 0xa8
    && codepoint != 0xb4 && codepoint != 0xb8 && codepoint != 0xbc
    && codepoint != 0xbd && codepoint != 0xbe) {
    outc = static_cast<char>(codepoint);
    }
    else {
    outc = '?';
    }
    }
    else {
    // With a few exceptions, codepoints above 255 cannot be converted
    if (codepoint == 0x20AC) {
    outc = 0xa4;
    }
    else if (codepoint == 0x0160) {
    outc = 0xa6;
    }
    else if (codepoint == 0x0161) {
    outc = 0xa8;
    }
    else if (codepoint == 0x017d) {
    outc = 0xb4;
    }
    else if (codepoint == 0x017e) {
    outc = 0xb8;
    }
    else if (codepoint == 0x0152) {
    outc = 0xbc;
    }
    else if (codepoint == 0x0153) {
    outc = 0xbd;
    }
    else if (codepoint == 0x0178) {
    outc = 0xbe;
    }
    else {
    outc = '?';
    }
    }
    out.append(1, outc);
    }
    }
    return out;
    }





    share|improve this answer


























      1














      I like this code. It's surprisingly short. Most of the code just deals with decoding multi-byte sequences into codepoints. Once a codepoint has been decoded, the conversion to ISO-8859-1 is very simple:




      • If it's less or equal 255, it's also a valid ISO-8859-1 character: out.append(1, static_cast<char>(codepoint));

      • If not, it cannot be represented in ISO-8859-1 and is replaced with a question mark: out.append("?");


      So to make it work for ISO-8859-15, more code is needed to handle the characters that have been replaced when ISO-8859-15 was introduced (see Comparing ISO-8859-1 and ISO-8859-15). Unfortunately, it considerably increases the code size.



      The below code is supposed to be easy to understand. It can be optimized for better performance if that's a main concern.



      std::string UTF8toISO8859_1(const char * in) {
      std::string out;
      if (in == NULL)
      return out;

      unsigned int codepoint;
      while (*in != 0) {
      unsigned char ch = static_cast<unsigned char>(*in);
      if (ch <= 0x7f)
      codepoint = ch;
      else if (ch <= 0xbf)
      codepoint = (codepoint << 6) | (ch & 0x3f);
      else if (ch <= 0xdf)
      codepoint = ch & 0x1f;
      else if (ch <= 0xef)
      codepoint = ch & 0x0f;
      else
      codepoint = ch & 0x07;
      ++in;

      if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
      // a valid codepoint has been decoded; convert it to ISO-8859-15
      char outc;
      if (codepoint <= 255) {
      // codepoints up to 255 can be directly converted wit a few exceptions
      if (codepoint != 0xa4 && codepoint != 0xa6 && codepoint != 0xa8
      && codepoint != 0xb4 && codepoint != 0xb8 && codepoint != 0xbc
      && codepoint != 0xbd && codepoint != 0xbe) {
      outc = static_cast<char>(codepoint);
      }
      else {
      outc = '?';
      }
      }
      else {
      // With a few exceptions, codepoints above 255 cannot be converted
      if (codepoint == 0x20AC) {
      outc = 0xa4;
      }
      else if (codepoint == 0x0160) {
      outc = 0xa6;
      }
      else if (codepoint == 0x0161) {
      outc = 0xa8;
      }
      else if (codepoint == 0x017d) {
      outc = 0xb4;
      }
      else if (codepoint == 0x017e) {
      outc = 0xb8;
      }
      else if (codepoint == 0x0152) {
      outc = 0xbc;
      }
      else if (codepoint == 0x0153) {
      outc = 0xbd;
      }
      else if (codepoint == 0x0178) {
      outc = 0xbe;
      }
      else {
      outc = '?';
      }
      }
      out.append(1, outc);
      }
      }
      return out;
      }





      share|improve this answer
























        1












        1








        1






        I like this code. It's surprisingly short. Most of the code just deals with decoding multi-byte sequences into codepoints. Once a codepoint has been decoded, the conversion to ISO-8859-1 is very simple:




        • If it's less or equal 255, it's also a valid ISO-8859-1 character: out.append(1, static_cast<char>(codepoint));

        • If not, it cannot be represented in ISO-8859-1 and is replaced with a question mark: out.append("?");


        So to make it work for ISO-8859-15, more code is needed to handle the characters that have been replaced when ISO-8859-15 was introduced (see Comparing ISO-8859-1 and ISO-8859-15). Unfortunately, it considerably increases the code size.



        The below code is supposed to be easy to understand. It can be optimized for better performance if that's a main concern.



        std::string UTF8toISO8859_1(const char * in) {
        std::string out;
        if (in == NULL)
        return out;

        unsigned int codepoint;
        while (*in != 0) {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
        codepoint = ch;
        else if (ch <= 0xbf)
        codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
        codepoint = ch & 0x1f;
        else if (ch <= 0xef)
        codepoint = ch & 0x0f;
        else
        codepoint = ch & 0x07;
        ++in;

        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
        // a valid codepoint has been decoded; convert it to ISO-8859-15
        char outc;
        if (codepoint <= 255) {
        // codepoints up to 255 can be directly converted wit a few exceptions
        if (codepoint != 0xa4 && codepoint != 0xa6 && codepoint != 0xa8
        && codepoint != 0xb4 && codepoint != 0xb8 && codepoint != 0xbc
        && codepoint != 0xbd && codepoint != 0xbe) {
        outc = static_cast<char>(codepoint);
        }
        else {
        outc = '?';
        }
        }
        else {
        // With a few exceptions, codepoints above 255 cannot be converted
        if (codepoint == 0x20AC) {
        outc = 0xa4;
        }
        else if (codepoint == 0x0160) {
        outc = 0xa6;
        }
        else if (codepoint == 0x0161) {
        outc = 0xa8;
        }
        else if (codepoint == 0x017d) {
        outc = 0xb4;
        }
        else if (codepoint == 0x017e) {
        outc = 0xb8;
        }
        else if (codepoint == 0x0152) {
        outc = 0xbc;
        }
        else if (codepoint == 0x0153) {
        outc = 0xbd;
        }
        else if (codepoint == 0x0178) {
        outc = 0xbe;
        }
        else {
        outc = '?';
        }
        }
        out.append(1, outc);
        }
        }
        return out;
        }





        share|improve this answer












        I like this code. It's surprisingly short. Most of the code just deals with decoding multi-byte sequences into codepoints. Once a codepoint has been decoded, the conversion to ISO-8859-1 is very simple:




        • If it's less or equal 255, it's also a valid ISO-8859-1 character: out.append(1, static_cast<char>(codepoint));

        • If not, it cannot be represented in ISO-8859-1 and is replaced with a question mark: out.append("?");


        So to make it work for ISO-8859-15, more code is needed to handle the characters that have been replaced when ISO-8859-15 was introduced (see Comparing ISO-8859-1 and ISO-8859-15). Unfortunately, it considerably increases the code size.



        The below code is supposed to be easy to understand. It can be optimized for better performance if that's a main concern.



        std::string UTF8toISO8859_1(const char * in) {
        std::string out;
        if (in == NULL)
        return out;

        unsigned int codepoint;
        while (*in != 0) {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
        codepoint = ch;
        else if (ch <= 0xbf)
        codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
        codepoint = ch & 0x1f;
        else if (ch <= 0xef)
        codepoint = ch & 0x0f;
        else
        codepoint = ch & 0x07;
        ++in;

        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff)) {
        // a valid codepoint has been decoded; convert it to ISO-8859-15
        char outc;
        if (codepoint <= 255) {
        // codepoints up to 255 can be directly converted wit a few exceptions
        if (codepoint != 0xa4 && codepoint != 0xa6 && codepoint != 0xa8
        && codepoint != 0xb4 && codepoint != 0xb8 && codepoint != 0xbc
        && codepoint != 0xbd && codepoint != 0xbe) {
        outc = static_cast<char>(codepoint);
        }
        else {
        outc = '?';
        }
        }
        else {
        // With a few exceptions, codepoints above 255 cannot be converted
        if (codepoint == 0x20AC) {
        outc = 0xa4;
        }
        else if (codepoint == 0x0160) {
        outc = 0xa6;
        }
        else if (codepoint == 0x0161) {
        outc = 0xa8;
        }
        else if (codepoint == 0x017d) {
        outc = 0xb4;
        }
        else if (codepoint == 0x017e) {
        outc = 0xb8;
        }
        else if (codepoint == 0x0152) {
        outc = 0xbc;
        }
        else if (codepoint == 0x0153) {
        outc = 0xbd;
        }
        else if (codepoint == 0x0178) {
        outc = 0xbe;
        }
        else {
        outc = '?';
        }
        }
        out.append(1, outc);
        }
        }
        return out;
        }






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 12 '18 at 21:55









        CodoCodo

        50.6k11110148




        50.6k11110148






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53269432%2fconvert-from-utf-8-to-iso8859-15-in-c%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Florida Star v. B. J. F.

            Danny Elfman

            Lugert, Oklahoma