How to specify the endianness of utf-16 string literals in C++17 with Clang?

UTF-16 string literals, such as auto str = u"中国字";, are allowed in modern C++ source code.

UTF-16 has two endiannesses: UTF-16LE and UTF-16BE. The C++ standard doesn't specify the endianness of UTF-16 string literals. So, I think it is implementation-defined.

Is there any way to specify the endianness at compile-time?

asked Nov 15 '18 at 2:20

xmllmx

13.9k986211

That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.

– Some programmer dude
Nov 15 '18 at 2:25

UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.

– xmllmx
Nov 15 '18 at 2:27

As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.

– Some programmer dude
Nov 15 '18 at 2:29

The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.

– Sam Varshavchik
Nov 15 '18 at 2:30

@SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.

– xmllmx
Nov 15 '18 at 2:33

|
show 5 more comments

UTF-16 string literals, such as auto str = u"中国字";, are allowed in modern C++ source code.

UTF-16 has two endiannesses: UTF-16LE and UTF-16BE. The C++ standard doesn't specify the endianness of UTF-16 string literals. So, I think it is implementation-defined.

Is there any way to specify the endianness at compile-time?

asked Nov 15 '18 at 2:20

xmllmx

13.9k986211

That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.

– Some programmer dude
Nov 15 '18 at 2:25

UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.

– xmllmx
Nov 15 '18 at 2:27

As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.

– Some programmer dude
Nov 15 '18 at 2:29

The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.

– Sam Varshavchik
Nov 15 '18 at 2:30

@SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.

– xmllmx
Nov 15 '18 at 2:33

|
show 5 more comments

UTF-16 string literals, such as auto str = u"中国字";, are allowed in modern C++ source code.

UTF-16 has two endiannesses: UTF-16LE and UTF-16BE. The C++ standard doesn't specify the endianness of UTF-16 string literals. So, I think it is implementation-defined.

Is there any way to specify the endianness at compile-time?

asked Nov 15 '18 at 2:20

xmllmx

13.9k986211

UTF-16 string literals, such as auto str = u"中国字";, are allowed in modern C++ source code.

UTF-16 has two endiannesses: UTF-16LE and UTF-16BE. The C++ standard doesn't specify the endianness of UTF-16 string literals. So, I think it is implementation-defined.

Is there any way to specify the endianness at compile-time?

c++ unicode clang standards c++17

asked Nov 15 '18 at 2:20

xmllmx

13.9k986211

asked Nov 15 '18 at 2:20

xmllmx

13.9k986211

asked Nov 15 '18 at 2:20

xmllmx

13.9k986211

asked Nov 15 '18 at 2:20

xmllmx

13.9k986211

asked Nov 15 '18 at 2:20

xmllmx

13.9k986211

That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.

– Some programmer dude
Nov 15 '18 at 2:25

UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.

– xmllmx
Nov 15 '18 at 2:27

As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.

– Some programmer dude
Nov 15 '18 at 2:29

The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.

– Sam Varshavchik
Nov 15 '18 at 2:30

@SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.

– xmllmx
Nov 15 '18 at 2:33

|
show 5 more comments

That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.

– Some programmer dude
Nov 15 '18 at 2:25

UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.

– xmllmx
Nov 15 '18 at 2:27

As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.

– Some programmer dude
Nov 15 '18 at 2:29

The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.

– Sam Varshavchik
Nov 15 '18 at 2:30

@SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.

– xmllmx
Nov 15 '18 at 2:33

That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.

– Some programmer dude
Nov 15 '18 at 2:25

UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.

– xmllmx
Nov 15 '18 at 2:27

As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.

– Some programmer dude
Nov 15 '18 at 2:29

The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.

– Sam Varshavchik
Nov 15 '18 at 2:30

@SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.

– xmllmx
Nov 15 '18 at 2:33

|
show 5 more comments

1 Answer
1

active

oldest

votes

A string literal prefixed with u is an array of const char16_t values:

C++17 [lex.string]/10:

A string-literal that begins with u , such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it is initialized with the given characters.

So the literal in the quote is equivalent to, on a Unicode system:

const char16_t x = { 97, 115, 100, 102, 0 };

In other words, the representation of the string literal is the same as the representation of that array.

For a more complicated string, it is still an array of const char16_t; and there may be multiple code points per c-char, i.e. the number of elements in the array might be more than the number of characters that seem to appear in the string.

To answer the question in the title: I'm not aware of any compiler option (for any compiler) that would let you configure the endianness of char16_t. I would expect any target system to use the same endianness for all the integral types. char16_t is supposed to have the same properties as uint_least16_t ([basic.fundamental]/5).

If your code contains string literals and you want to write them into a file as specifically UTF16-BE for example, you'll need to do the usual endian checks/adjustments in case your system stores char16_t in little endian form.

edited Nov 15 '18 at 2:47

answered Nov 15 '18 at 2:42

M.M

106k11119240

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311511%2fhow-to-specify-the-endianness-of-utf-16-string-literals-in-c17-with-clang%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

A string literal prefixed with u is an array of const char16_t values:

C++17 [lex.string]/10:

A string-literal that begins with u , such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it is initialized with the given characters.

So the literal in the quote is equivalent to, on a Unicode system:

const char16_t x = { 97, 115, 100, 102, 0 };

In other words, the representation of the string literal is the same as the representation of that array.

edited Nov 15 '18 at 2:47

answered Nov 15 '18 at 2:42

M.M

106k11119240

add a comment |

A string literal prefixed with u is an array of const char16_t values:

C++17 [lex.string]/10:

A string-literal that begins with u , such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it is initialized with the given characters.

So the literal in the quote is equivalent to, on a Unicode system:

const char16_t x = { 97, 115, 100, 102, 0 };

In other words, the representation of the string literal is the same as the representation of that array.

edited Nov 15 '18 at 2:47

answered Nov 15 '18 at 2:42

M.M

106k11119240

add a comment |

A string literal prefixed with u is an array of const char16_t values:

C++17 [lex.string]/10:

A string-literal that begins with u , such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it is initialized with the given characters.

So the literal in the quote is equivalent to, on a Unicode system:

const char16_t x = { 97, 115, 100, 102, 0 };

In other words, the representation of the string literal is the same as the representation of that array.

edited Nov 15 '18 at 2:47

answered Nov 15 '18 at 2:42

M.M

106k11119240

A string literal prefixed with u is an array of const char16_t values:

C++17 [lex.string]/10:

A string-literal that begins with u , such as u"asdf", is a char16_t string literal. A char16_t string literal has type “array of n const char16_t”, where n is the size of the string as defined below; it is initialized with the given characters.

So the literal in the quote is equivalent to, on a Unicode system:

const char16_t x = { 97, 115, 100, 102, 0 };

In other words, the representation of the string literal is the same as the representation of that array.

edited Nov 15 '18 at 2:47

answered Nov 15 '18 at 2:42

M.M

106k11119240

edited Nov 15 '18 at 2:47

answered Nov 15 '18 at 2:42

M.M

106k11119240

answered Nov 15 '18 at 2:42

M.M

106k11119240

answered Nov 15 '18 at 2:42

M.M

106k11119240

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

OxM1z1nT8BAOnJc,y2AdkDpM9UA7Ax,VXIpzZsmVms0tl3YujmDAkdaRyzhZxr5DWATS D7zQgRpO,m3rj96U,4t0PmfzT

搜尋此網誌

Ndtyjky