How to specify the endianness of utf-16 string literals in C++17 with Clang?
UTF-16 string literals, such as auto str = u"中国字";
, are allowed in modern C++ source code.
UTF-16 has two endiannesses: UTF-16LE and UTF-16BE. The C++ standard doesn't specify the endianness of UTF-16 string literals. So, I think it is implementation-defined.
Is there any way to specify the endianness at compile-time?
c++ unicode clang standards c++17
|
show 5 more comments
UTF-16 string literals, such as auto str = u"中国字";
, are allowed in modern C++ source code.
UTF-16 has two endiannesses: UTF-16LE and UTF-16BE. The C++ standard doesn't specify the endianness of UTF-16 string literals. So, I think it is implementation-defined.
Is there any way to specify the endianness at compile-time?
c++ unicode clang standards c++17
That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.
– Some programmer dude
Nov 15 '18 at 2:25
UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.
– xmllmx
Nov 15 '18 at 2:27
As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.
– Some programmer dude
Nov 15 '18 at 2:29
The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.
– Sam Varshavchik
Nov 15 '18 at 2:30
@SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.
– xmllmx
Nov 15 '18 at 2:33
|
show 5 more comments
UTF-16 string literals, such as auto str = u"中国字";
, are allowed in modern C++ source code.
UTF-16 has two endiannesses: UTF-16LE and UTF-16BE. The C++ standard doesn't specify the endianness of UTF-16 string literals. So, I think it is implementation-defined.
Is there any way to specify the endianness at compile-time?
c++ unicode clang standards c++17
UTF-16 string literals, such as auto str = u"中国字";
, are allowed in modern C++ source code.
UTF-16 has two endiannesses: UTF-16LE and UTF-16BE. The C++ standard doesn't specify the endianness of UTF-16 string literals. So, I think it is implementation-defined.
Is there any way to specify the endianness at compile-time?
c++ unicode clang standards c++17
c++ unicode clang standards c++17
asked Nov 15 '18 at 2:20
xmllmxxmllmx
13.9k986211
13.9k986211
That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.
– Some programmer dude
Nov 15 '18 at 2:25
UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.
– xmllmx
Nov 15 '18 at 2:27
As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.
– Some programmer dude
Nov 15 '18 at 2:29
The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.
– Sam Varshavchik
Nov 15 '18 at 2:30
@SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.
– xmllmx
Nov 15 '18 at 2:33
|
show 5 more comments
That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.
– Some programmer dude
Nov 15 '18 at 2:25
UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.
– xmllmx
Nov 15 '18 at 2:27
As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.
– Some programmer dude
Nov 15 '18 at 2:29
The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.
– Sam Varshavchik
Nov 15 '18 at 2:30
@SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.
– xmllmx
Nov 15 '18 at 2:33
That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.
– Some programmer dude
Nov 15 '18 at 2:25
That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.
– Some programmer dude
Nov 15 '18 at 2:25
UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.
– xmllmx
Nov 15 '18 at 2:27
UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.
– xmllmx
Nov 15 '18 at 2:27
As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.
– Some programmer dude
Nov 15 '18 at 2:29
As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.
– Some programmer dude
Nov 15 '18 at 2:29
The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.
– Sam Varshavchik
Nov 15 '18 at 2:30
The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.
– Sam Varshavchik
Nov 15 '18 at 2:30
@SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.
– xmllmx
Nov 15 '18 at 2:33
@SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.
– xmllmx
Nov 15 '18 at 2:33
|
show 5 more comments
1 Answer
1
active
oldest
votes
A string literal prefixed with u
is an array of const char16_t
values:
C++17 [lex.string]/10:
A string-literal that begins with
u
, such asu"asdf"
, is achar16_t
string literal. Achar16_t
string literal has type “array of n constchar16_t
”, where n is the size of the string as defined below; it is initialized with the given characters.
So the literal in the quote is equivalent to, on a Unicode system:
const char16_t x = { 97, 115, 100, 102, 0 };
In other words, the representation of the string literal is the same as the representation of that array.
For a more complicated string, it is still an array of const char16_t
; and there may be multiple code points per c-char, i.e. the number of elements in the array might be more than the number of characters that seem to appear in the string.
To answer the question in the title: I'm not aware of any compiler option (for any compiler) that would let you configure the endianness of char16_t
. I would expect any target system to use the same endianness for all the integral types. char16_t
is supposed to have the same properties as uint_least16_t
([basic.fundamental]/5).
If your code contains string literals and you want to write them into a file as specifically UTF16-BE for example, you'll need to do the usual endian checks/adjustments in case your system stores char16_t
in little endian form.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311511%2fhow-to-specify-the-endianness-of-utf-16-string-literals-in-c17-with-clang%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
A string literal prefixed with u
is an array of const char16_t
values:
C++17 [lex.string]/10:
A string-literal that begins with
u
, such asu"asdf"
, is achar16_t
string literal. Achar16_t
string literal has type “array of n constchar16_t
”, where n is the size of the string as defined below; it is initialized with the given characters.
So the literal in the quote is equivalent to, on a Unicode system:
const char16_t x = { 97, 115, 100, 102, 0 };
In other words, the representation of the string literal is the same as the representation of that array.
For a more complicated string, it is still an array of const char16_t
; and there may be multiple code points per c-char, i.e. the number of elements in the array might be more than the number of characters that seem to appear in the string.
To answer the question in the title: I'm not aware of any compiler option (for any compiler) that would let you configure the endianness of char16_t
. I would expect any target system to use the same endianness for all the integral types. char16_t
is supposed to have the same properties as uint_least16_t
([basic.fundamental]/5).
If your code contains string literals and you want to write them into a file as specifically UTF16-BE for example, you'll need to do the usual endian checks/adjustments in case your system stores char16_t
in little endian form.
add a comment |
A string literal prefixed with u
is an array of const char16_t
values:
C++17 [lex.string]/10:
A string-literal that begins with
u
, such asu"asdf"
, is achar16_t
string literal. Achar16_t
string literal has type “array of n constchar16_t
”, where n is the size of the string as defined below; it is initialized with the given characters.
So the literal in the quote is equivalent to, on a Unicode system:
const char16_t x = { 97, 115, 100, 102, 0 };
In other words, the representation of the string literal is the same as the representation of that array.
For a more complicated string, it is still an array of const char16_t
; and there may be multiple code points per c-char, i.e. the number of elements in the array might be more than the number of characters that seem to appear in the string.
To answer the question in the title: I'm not aware of any compiler option (for any compiler) that would let you configure the endianness of char16_t
. I would expect any target system to use the same endianness for all the integral types. char16_t
is supposed to have the same properties as uint_least16_t
([basic.fundamental]/5).
If your code contains string literals and you want to write them into a file as specifically UTF16-BE for example, you'll need to do the usual endian checks/adjustments in case your system stores char16_t
in little endian form.
add a comment |
A string literal prefixed with u
is an array of const char16_t
values:
C++17 [lex.string]/10:
A string-literal that begins with
u
, such asu"asdf"
, is achar16_t
string literal. Achar16_t
string literal has type “array of n constchar16_t
”, where n is the size of the string as defined below; it is initialized with the given characters.
So the literal in the quote is equivalent to, on a Unicode system:
const char16_t x = { 97, 115, 100, 102, 0 };
In other words, the representation of the string literal is the same as the representation of that array.
For a more complicated string, it is still an array of const char16_t
; and there may be multiple code points per c-char, i.e. the number of elements in the array might be more than the number of characters that seem to appear in the string.
To answer the question in the title: I'm not aware of any compiler option (for any compiler) that would let you configure the endianness of char16_t
. I would expect any target system to use the same endianness for all the integral types. char16_t
is supposed to have the same properties as uint_least16_t
([basic.fundamental]/5).
If your code contains string literals and you want to write them into a file as specifically UTF16-BE for example, you'll need to do the usual endian checks/adjustments in case your system stores char16_t
in little endian form.
A string literal prefixed with u
is an array of const char16_t
values:
C++17 [lex.string]/10:
A string-literal that begins with
u
, such asu"asdf"
, is achar16_t
string literal. Achar16_t
string literal has type “array of n constchar16_t
”, where n is the size of the string as defined below; it is initialized with the given characters.
So the literal in the quote is equivalent to, on a Unicode system:
const char16_t x = { 97, 115, 100, 102, 0 };
In other words, the representation of the string literal is the same as the representation of that array.
For a more complicated string, it is still an array of const char16_t
; and there may be multiple code points per c-char, i.e. the number of elements in the array might be more than the number of characters that seem to appear in the string.
To answer the question in the title: I'm not aware of any compiler option (for any compiler) that would let you configure the endianness of char16_t
. I would expect any target system to use the same endianness for all the integral types. char16_t
is supposed to have the same properties as uint_least16_t
([basic.fundamental]/5).
If your code contains string literals and you want to write them into a file as specifically UTF16-BE for example, you'll need to do the usual endian checks/adjustments in case your system stores char16_t
in little endian form.
edited Nov 15 '18 at 2:47
answered Nov 15 '18 at 2:42
M.MM.M
106k11119240
106k11119240
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53311511%2fhow-to-specify-the-endianness-of-utf-16-string-literals-in-c17-with-clang%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
That's really one of the main reasons you should not use UTF-16 (or UTF-32) if you want to transfer the strings between programs or systems. Use UTF-8 instead. Internally inside your program use whatever encoding you want, but not when saving to a file or when transferring over a network.
– Some programmer dude
Nov 15 '18 at 2:25
UTF-8 has its disadvantages: It's hard to sort and search. So, in some cases, UTF-16 is preferred.
– xmllmx
Nov 15 '18 at 2:27
As I modified my comment to say, you can use it internally inside your program (as long as you're aware of that it will not represent all of Unicode and is a variable-length encoding). Externally outside the program, use UTF-8.
– Some programmer dude
Nov 15 '18 at 2:29
The short answer is "no". Unicode string literals use the natural endian-ness of the implementation.
– Sam Varshavchik
Nov 15 '18 at 2:30
@SamVarshavchik, My real issue is: if I have many UTF-16LE strings loaded from network, and the local natural endianness is UTF-16BE, then I must convert them dynamically, which is time-consuming, rather than just statically specify the endianness.
– xmllmx
Nov 15 '18 at 2:33