C/C++: relaxed std::atomic vs unlocked bool on X64 architecture
up vote
1
down vote
favorite
Is there any efficency benefit to using an unlocked boolean over using an std::atomic<bool>
where the operations are always done with relaxed memory order? I would assume that both eventually compile to the same machine code, since a single byte is actually atomic on X64 hardware. Am I wrong?
c++ performance synchronization x86-64 atomic
add a comment |
up vote
1
down vote
favorite
Is there any efficency benefit to using an unlocked boolean over using an std::atomic<bool>
where the operations are always done with relaxed memory order? I would assume that both eventually compile to the same machine code, since a single byte is actually atomic on X64 hardware. Am I wrong?
c++ performance synchronization x86-64 atomic
"since a single byte is actually atomic in hardware" - that's not a given fact.
– Jesper Juhl
Nov 11 at 18:30
Not even on X64 architecture? (Note what I wrote in the title)
– tohava
Nov 11 at 18:32
3
@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them,bool
would be a word wide, not a byte.)
– Peter Cordes
Nov 11 at 19:21
add a comment |
up vote
1
down vote
favorite
up vote
1
down vote
favorite
Is there any efficency benefit to using an unlocked boolean over using an std::atomic<bool>
where the operations are always done with relaxed memory order? I would assume that both eventually compile to the same machine code, since a single byte is actually atomic on X64 hardware. Am I wrong?
c++ performance synchronization x86-64 atomic
Is there any efficency benefit to using an unlocked boolean over using an std::atomic<bool>
where the operations are always done with relaxed memory order? I would assume that both eventually compile to the same machine code, since a single byte is actually atomic on X64 hardware. Am I wrong?
c++ performance synchronization x86-64 atomic
c++ performance synchronization x86-64 atomic
edited Nov 11 at 18:32
asked Nov 11 at 18:10
tohava
3,47911032
3,47911032
"since a single byte is actually atomic in hardware" - that's not a given fact.
– Jesper Juhl
Nov 11 at 18:30
Not even on X64 architecture? (Note what I wrote in the title)
– tohava
Nov 11 at 18:32
3
@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them,bool
would be a word wide, not a byte.)
– Peter Cordes
Nov 11 at 19:21
add a comment |
"since a single byte is actually atomic in hardware" - that's not a given fact.
– Jesper Juhl
Nov 11 at 18:30
Not even on X64 architecture? (Note what I wrote in the title)
– tohava
Nov 11 at 18:32
3
@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them,bool
would be a word wide, not a byte.)
– Peter Cordes
Nov 11 at 19:21
"since a single byte is actually atomic in hardware" - that's not a given fact.
– Jesper Juhl
Nov 11 at 18:30
"since a single byte is actually atomic in hardware" - that's not a given fact.
– Jesper Juhl
Nov 11 at 18:30
Not even on X64 architecture? (Note what I wrote in the title)
– tohava
Nov 11 at 18:32
Not even on X64 architecture? (Note what I wrote in the title)
– tohava
Nov 11 at 18:32
3
3
@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them,
bool
would be a word wide, not a byte.)– Peter Cordes
Nov 11 at 19:21
@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them,
bool
would be a word wide, not a byte.)– Peter Cordes
Nov 11 at 19:21
add a comment |
2 Answers
2
active
oldest
votes
up vote
4
down vote
accepted
Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<>
variable can't be optimized into a register.
If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile
.
Current compilers also never combine multiple reads of an atomic
variable into one, as if you'd used volatile atomic<T>
, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).
This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if
around the loop like a normal person.)
int sumarr_atomic(int arr) {
int sum = 0;
for(int i=0 ; i<10000 ; i++) {
if (atomic_bool.load (std::memory_order_relaxed)) {
sum += arr[i];
}
}
return sum;
}
See the asm output on Godbolt.
But with a non-atomic bool
, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).
With atomic_bool
, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.
(The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)
Loops over an array of bool
can auto-vectorize, but not over atomic<bool>
.
Also, inverting a boolean with something like b ^= 1;
or b++
can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor
or lock btc
. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock
prefix is also a full memory barrier.)
Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.
void loop() {
for(int i=0 ; i<10000 ; i++) {
regular_bool ^= 1;
}
}
compiles to asm that keeps regular_bool
in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.
loop():
movzx edx, BYTE PTR regular_bool[rip] # load into a register
mov eax, 10000
.L17: # do {
xor edx, 1 # flip the boolean
sub eax, 1
jne .L17 # } while(--i);
mov BYTE PTR regular_bool[rip], dl # store back the result
ret
Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed)
(separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.
add a comment |
up vote
1
down vote
Checking over at Godbolt, loading a regular bool
and a std::atomic<bool>
generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool>
is guaranteed to be either 0 or 1. Strange, that.
Clang does the same thing, although the code generated is slightly different in detail.
Usingcout <<
clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, likebool load_regular() { return regular_bool; }
that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
– Peter Cordes
Nov 11 at 18:39
@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
– Paul Sanders
Nov 11 at 18:40
Yeah I know, and my point is that returning a value from a function instead of writing amain
solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
– Peter Cordes
Nov 11 at 18:42
@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
– Paul Sanders
Nov 11 at 18:44
Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make themstatic
orinline
.
– Peter Cordes
Nov 11 at 18:47
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251703%2fc-c-relaxed-stdatomicbool-vs-unlocked-bool-on-x64-architecture%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
4
down vote
accepted
Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<>
variable can't be optimized into a register.
If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile
.
Current compilers also never combine multiple reads of an atomic
variable into one, as if you'd used volatile atomic<T>
, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).
This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if
around the loop like a normal person.)
int sumarr_atomic(int arr) {
int sum = 0;
for(int i=0 ; i<10000 ; i++) {
if (atomic_bool.load (std::memory_order_relaxed)) {
sum += arr[i];
}
}
return sum;
}
See the asm output on Godbolt.
But with a non-atomic bool
, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).
With atomic_bool
, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.
(The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)
Loops over an array of bool
can auto-vectorize, but not over atomic<bool>
.
Also, inverting a boolean with something like b ^= 1;
or b++
can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor
or lock btc
. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock
prefix is also a full memory barrier.)
Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.
void loop() {
for(int i=0 ; i<10000 ; i++) {
regular_bool ^= 1;
}
}
compiles to asm that keeps regular_bool
in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.
loop():
movzx edx, BYTE PTR regular_bool[rip] # load into a register
mov eax, 10000
.L17: # do {
xor edx, 1 # flip the boolean
sub eax, 1
jne .L17 # } while(--i);
mov BYTE PTR regular_bool[rip], dl # store back the result
ret
Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed)
(separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.
add a comment |
up vote
4
down vote
accepted
Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<>
variable can't be optimized into a register.
If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile
.
Current compilers also never combine multiple reads of an atomic
variable into one, as if you'd used volatile atomic<T>
, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).
This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if
around the loop like a normal person.)
int sumarr_atomic(int arr) {
int sum = 0;
for(int i=0 ; i<10000 ; i++) {
if (atomic_bool.load (std::memory_order_relaxed)) {
sum += arr[i];
}
}
return sum;
}
See the asm output on Godbolt.
But with a non-atomic bool
, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).
With atomic_bool
, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.
(The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)
Loops over an array of bool
can auto-vectorize, but not over atomic<bool>
.
Also, inverting a boolean with something like b ^= 1;
or b++
can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor
or lock btc
. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock
prefix is also a full memory barrier.)
Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.
void loop() {
for(int i=0 ; i<10000 ; i++) {
regular_bool ^= 1;
}
}
compiles to asm that keeps regular_bool
in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.
loop():
movzx edx, BYTE PTR regular_bool[rip] # load into a register
mov eax, 10000
.L17: # do {
xor edx, 1 # flip the boolean
sub eax, 1
jne .L17 # } while(--i);
mov BYTE PTR regular_bool[rip], dl # store back the result
ret
Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed)
(separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.
add a comment |
up vote
4
down vote
accepted
up vote
4
down vote
accepted
Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<>
variable can't be optimized into a register.
If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile
.
Current compilers also never combine multiple reads of an atomic
variable into one, as if you'd used volatile atomic<T>
, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).
This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if
around the loop like a normal person.)
int sumarr_atomic(int arr) {
int sum = 0;
for(int i=0 ; i<10000 ; i++) {
if (atomic_bool.load (std::memory_order_relaxed)) {
sum += arr[i];
}
}
return sum;
}
See the asm output on Godbolt.
But with a non-atomic bool
, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).
With atomic_bool
, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.
(The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)
Loops over an array of bool
can auto-vectorize, but not over atomic<bool>
.
Also, inverting a boolean with something like b ^= 1;
or b++
can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor
or lock btc
. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock
prefix is also a full memory barrier.)
Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.
void loop() {
for(int i=0 ; i<10000 ; i++) {
regular_bool ^= 1;
}
}
compiles to asm that keeps regular_bool
in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.
loop():
movzx edx, BYTE PTR regular_bool[rip] # load into a register
mov eax, 10000
.L17: # do {
xor edx, 1 # flip the boolean
sub eax, 1
jne .L17 # } while(--i);
mov BYTE PTR regular_bool[rip], dl # store back the result
ret
Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed)
(separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.
Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<>
variable can't be optimized into a register.
If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile
.
Current compilers also never combine multiple reads of an atomic
variable into one, as if you'd used volatile atomic<T>
, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).
This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if
around the loop like a normal person.)
int sumarr_atomic(int arr) {
int sum = 0;
for(int i=0 ; i<10000 ; i++) {
if (atomic_bool.load (std::memory_order_relaxed)) {
sum += arr[i];
}
}
return sum;
}
See the asm output on Godbolt.
But with a non-atomic bool
, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).
With atomic_bool
, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.
(The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)
Loops over an array of bool
can auto-vectorize, but not over atomic<bool>
.
Also, inverting a boolean with something like b ^= 1;
or b++
can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor
or lock btc
. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock
prefix is also a full memory barrier.)
Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.
void loop() {
for(int i=0 ; i<10000 ; i++) {
regular_bool ^= 1;
}
}
compiles to asm that keeps regular_bool
in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.
loop():
movzx edx, BYTE PTR regular_bool[rip] # load into a register
mov eax, 10000
.L17: # do {
xor edx, 1 # flip the boolean
sub eax, 1
jne .L17 # } while(--i);
mov BYTE PTR regular_bool[rip], dl # store back the result
ret
Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed)
(separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.
edited Nov 12 at 18:47
answered Nov 11 at 19:19
Peter Cordes
117k16180306
117k16180306
add a comment |
add a comment |
up vote
1
down vote
Checking over at Godbolt, loading a regular bool
and a std::atomic<bool>
generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool>
is guaranteed to be either 0 or 1. Strange, that.
Clang does the same thing, although the code generated is slightly different in detail.
Usingcout <<
clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, likebool load_regular() { return regular_bool; }
that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
– Peter Cordes
Nov 11 at 18:39
@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
– Paul Sanders
Nov 11 at 18:40
Yeah I know, and my point is that returning a value from a function instead of writing amain
solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
– Peter Cordes
Nov 11 at 18:42
@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
– Paul Sanders
Nov 11 at 18:44
Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make themstatic
orinline
.
– Peter Cordes
Nov 11 at 18:47
add a comment |
up vote
1
down vote
Checking over at Godbolt, loading a regular bool
and a std::atomic<bool>
generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool>
is guaranteed to be either 0 or 1. Strange, that.
Clang does the same thing, although the code generated is slightly different in detail.
Usingcout <<
clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, likebool load_regular() { return regular_bool; }
that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
– Peter Cordes
Nov 11 at 18:39
@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
– Paul Sanders
Nov 11 at 18:40
Yeah I know, and my point is that returning a value from a function instead of writing amain
solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
– Peter Cordes
Nov 11 at 18:42
@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
– Paul Sanders
Nov 11 at 18:44
Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make themstatic
orinline
.
– Peter Cordes
Nov 11 at 18:47
add a comment |
up vote
1
down vote
up vote
1
down vote
Checking over at Godbolt, loading a regular bool
and a std::atomic<bool>
generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool>
is guaranteed to be either 0 or 1. Strange, that.
Clang does the same thing, although the code generated is slightly different in detail.
Checking over at Godbolt, loading a regular bool
and a std::atomic<bool>
generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool>
is guaranteed to be either 0 or 1. Strange, that.
Clang does the same thing, although the code generated is slightly different in detail.
edited Nov 11 at 18:39
answered Nov 11 at 18:36
Paul Sanders
4,8801521
4,8801521
Usingcout <<
clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, likebool load_regular() { return regular_bool; }
that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
– Peter Cordes
Nov 11 at 18:39
@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
– Paul Sanders
Nov 11 at 18:40
Yeah I know, and my point is that returning a value from a function instead of writing amain
solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
– Peter Cordes
Nov 11 at 18:42
@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
– Paul Sanders
Nov 11 at 18:44
Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make themstatic
orinline
.
– Peter Cordes
Nov 11 at 18:47
add a comment |
Usingcout <<
clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, likebool load_regular() { return regular_bool; }
that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
– Peter Cordes
Nov 11 at 18:39
@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
– Paul Sanders
Nov 11 at 18:40
Yeah I know, and my point is that returning a value from a function instead of writing amain
solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
– Peter Cordes
Nov 11 at 18:42
@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
– Paul Sanders
Nov 11 at 18:44
Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make themstatic
orinline
.
– Peter Cordes
Nov 11 at 18:47
Using
cout <<
clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() { return regular_bool; }
that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)– Peter Cordes
Nov 11 at 18:39
Using
cout <<
clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() { return regular_bool; }
that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)– Peter Cordes
Nov 11 at 18:39
@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
– Paul Sanders
Nov 11 at 18:40
@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
– Paul Sanders
Nov 11 at 18:40
Yeah I know, and my point is that returning a value from a function instead of writing a
main
solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.– Peter Cordes
Nov 11 at 18:42
Yeah I know, and my point is that returning a value from a function instead of writing a
main
solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.– Peter Cordes
Nov 11 at 18:42
@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
– Paul Sanders
Nov 11 at 18:44
@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
– Paul Sanders
Nov 11 at 18:44
Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them
static
or inline
.– Peter Cordes
Nov 11 at 18:47
Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them
static
or inline
.– Peter Cordes
Nov 11 at 18:47
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251703%2fc-c-relaxed-stdatomicbool-vs-unlocked-bool-on-x64-architecture%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
"since a single byte is actually atomic in hardware" - that's not a given fact.
– Jesper Juhl
Nov 11 at 18:30
Not even on X64 architecture? (Note what I wrote in the title)
– tohava
Nov 11 at 18:32
3
@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them,
bool
would be a word wide, not a byte.)– Peter Cordes
Nov 11 at 19:21