How to specify the endianness of utf-16 string literals in C++17 with Clang?












2















UTF-16 string literals, such as auto str = u"中国字";, are allowed in modern C++ source code.



UTF-16 has two endiannesses: UTF-16LE and UTF-16BE. The C++ standard doesn't specify the endianness of UTF-16 string literals. So, I think it is implementation-defined.



Is there any way to specify the endianness at compile-time?










share|improve this question























  • That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.

    – Some programmer dude
    Nov 15 '18 at 2:25













  • UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.

    – xmllmx
    Nov 15 '18 at 2:27













  • As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.

    – Some programmer dude
    Nov 15 '18 at 2:29











  • The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.

    – Sam Varshavchik
    Nov 15 '18 at 2:30











  • @SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.

    – xmllmx
    Nov 15 '18 at 2:33


















2















UTF-16 string literals, such as auto str = u"中国字";, are allowed in modern C++ source code.



UTF-16 has two endiannesses: UTF-16LE and UTF-16BE. The C++ standard doesn't specify the endianness of UTF-16 string literals. So, I think it is implementation-defined.



Is there any way to specify the endianness at compile-time?










share|improve this question























  • That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.

    – Some programmer dude
    Nov 15 '18 at 2:25













  • UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.

    – xmllmx
    Nov 15 '18 at 2:27













  • As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.

    – Some programmer dude
    Nov 15 '18 at 2:29











  • The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.

    – Sam Varshavchik
    Nov 15 '18 at 2:30











  • @SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.

    – xmllmx
    Nov 15 '18 at 2:33
















2












2








2


3






UTF-16 string literals, such as auto str = u"中国字";, are allowed in modern C++ source code.



UTF-16 has two endiannesses: UTF-16LE and UTF-16BE. The C++ standard doesn't specify the endianness of UTF-16 string literals. So, I think it is implementation-defined.



Is there any way to specify the endianness at compile-time?










share|improve this question














UTF-16 string literals, such as auto str = u"中国字";, are allowed in modern C++ source code.



UTF-16 has two endiannesses: UTF-16LE and UTF-16BE. The C++ standard doesn't specify the endianness of UTF-16 string literals. So, I think it is implementation-defined.



Is there any way to specify the endianness at compile-time?







c++ unicode clang standards c++17






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 2:20









xmllmxxmllmx

13.9k986211




13.9k986211













  • That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.

    – Some programmer dude
    Nov 15 '18 at 2:25













  • UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.

    – xmllmx
    Nov 15 '18 at 2:27













  • As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.

    – Some programmer dude
    Nov 15 '18 at 2:29











  • The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.

    – Sam Varshavchik
    Nov 15 '18 at 2:30











  • @SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.

    – xmllmx
    Nov 15 '18 at 2:33





















  • That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.

    – Some programmer dude
    Nov 15 '18 at 2:25













  • UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.

    – xmllmx
    Nov 15 '18 at 2:27













  • As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.

    – Some programmer dude
    Nov 15 '18 at 2:29











  • The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.

    – Sam Varshavchik
    Nov 15 '18 at 2:30











  • @SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.

    – xmllmx
    Nov 15 '18 at 2:33



















That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.

– Some programmer dude
Nov 15 '18 at 2:25







That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.

– Some programmer dude
Nov 15 '18 at 2:25















UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.

– xmllmx
Nov 15 '18 at 2:27







UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.

– xmllmx
Nov 15 '18 at 2:27















As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.

– Some programmer dude
Nov 15 '18 at 2:29





As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.

– Some programmer dude
Nov 15 '18 at 2:29













The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.

– Sam Varshavchik
Nov 15 '18 at 2:30





The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.

– Sam Varshavchik
Nov 15 '18 at 2:30













@SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.

– xmllmx
Nov 15 '18 at 2:33







@SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.

– xmllmx
Nov 15 '18 at 2:33














1 Answer
1






active

oldest

votes


















2














A string literal prefixed with u is an array of const char16_t values:



C++17 [lex.string]/10:




A string-literal that begins with u , such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it is initialized with the given characters.




So the literal in the quote is equivalent to, on a Unicode system:



const char16_t x = { 97, 115, 100, 102, 0 };


In other words, the representation of the string literal is the same as the representation of that array.



For a more complicated string, it is still an array of const char16_t; and there may be multiple code points per c-char, i.e. the number of elements in the array might be more than the number of characters that seem to appear in the string.





To answer the question in the title: I'm not aware of any compiler option (for any compiler) that would let you configure the endianness of char16_t. I would expect any target system to use the same endianness for all the integral types. char16_t is supposed to have the same properties as uint_least16_t ([basic.fundamental]/5).



If your code contains string literals and you want to write them into a file as specifically UTF16-BE for example, you'll need to do the usual endian checks/adjustments in case your system stores char16_t in little endian form.






share|improve this answer

























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311511%2fhow-to-specify-the-endianness-of-utf-16-string-literals-in-c17-with-clang%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    2














    A string literal prefixed with u is an array of const char16_t values:



    C++17 [lex.string]/10:




    A string-literal that begins with u , such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it is initialized with the given characters.




    So the literal in the quote is equivalent to, on a Unicode system:



    const char16_t x = { 97, 115, 100, 102, 0 };


    In other words, the representation of the string literal is the same as the representation of that array.



    For a more complicated string, it is still an array of const char16_t; and there may be multiple code points per c-char, i.e. the number of elements in the array might be more than the number of characters that seem to appear in the string.





    To answer the question in the title: I'm not aware of any compiler option (for any compiler) that would let you configure the endianness of char16_t. I would expect any target system to use the same endianness for all the integral types. char16_t is supposed to have the same properties as uint_least16_t ([basic.fundamental]/5).



    If your code contains string literals and you want to write them into a file as specifically UTF16-BE for example, you'll need to do the usual endian checks/adjustments in case your system stores char16_t in little endian form.






    share|improve this answer






























      2














      A string literal prefixed with u is an array of const char16_t values:



      C++17 [lex.string]/10:




      A string-literal that begins with u , such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it is initialized with the given characters.




      So the literal in the quote is equivalent to, on a Unicode system:



      const char16_t x = { 97, 115, 100, 102, 0 };


      In other words, the representation of the string literal is the same as the representation of that array.



      For a more complicated string, it is still an array of const char16_t; and there may be multiple code points per c-char, i.e. the number of elements in the array might be more than the number of characters that seem to appear in the string.





      To answer the question in the title: I'm not aware of any compiler option (for any compiler) that would let you configure the endianness of char16_t. I would expect any target system to use the same endianness for all the integral types. char16_t is supposed to have the same properties as uint_least16_t ([basic.fundamental]/5).



      If your code contains string literals and you want to write them into a file as specifically UTF16-BE for example, you'll need to do the usual endian checks/adjustments in case your system stores char16_t in little endian form.






      share|improve this answer




























        2












        2








        2







        A string literal prefixed with u is an array of const char16_t values:



        C++17 [lex.string]/10:




        A string-literal that begins with u , such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it is initialized with the given characters.




        So the literal in the quote is equivalent to, on a Unicode system:



        const char16_t x = { 97, 115, 100, 102, 0 };


        In other words, the representation of the string literal is the same as the representation of that array.



        For a more complicated string, it is still an array of const char16_t; and there may be multiple code points per c-char, i.e. the number of elements in the array might be more than the number of characters that seem to appear in the string.





        To answer the question in the title: I'm not aware of any compiler option (for any compiler) that would let you configure the endianness of char16_t. I would expect any target system to use the same endianness for all the integral types. char16_t is supposed to have the same properties as uint_least16_t ([basic.fundamental]/5).



        If your code contains string literals and you want to write them into a file as specifically UTF16-BE for example, you'll need to do the usual endian checks/adjustments in case your system stores char16_t in little endian form.






        share|improve this answer















        A string literal prefixed with u is an array of const char16_t values:



        C++17 [lex.string]/10:




        A string-literal that begins with u , such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it is initialized with the given characters.




        So the literal in the quote is equivalent to, on a Unicode system:



        const char16_t x = { 97, 115, 100, 102, 0 };


        In other words, the representation of the string literal is the same as the representation of that array.



        For a more complicated string, it is still an array of const char16_t; and there may be multiple code points per c-char, i.e. the number of elements in the array might be more than the number of characters that seem to appear in the string.





        To answer the question in the title: I'm not aware of any compiler option (for any compiler) that would let you configure the endianness of char16_t. I would expect any target system to use the same endianness for all the integral types. char16_t is supposed to have the same properties as uint_least16_t ([basic.fundamental]/5).



        If your code contains string literals and you want to write them into a file as specifically UTF16-BE for example, you'll need to do the usual endian checks/adjustments in case your system stores char16_t in little endian form.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 15 '18 at 2:47

























        answered Nov 15 '18 at 2:42









        M.MM.M

        106k11119240




        106k11119240
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311511%2fhow-to-specify-the-endianness-of-utf-16-string-literals-in-c17-with-clang%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Florida Star v. B. J. F.

            Danny Elfman

            Lugert, Oklahoma