C/C++: relaxed std::atomic vs unlocked bool on X64 architecture

up vote
1
down vote

favorite

Is there any efficency benefit to using an unlocked boolean over using an std::atomic<bool> where the operations are always done with relaxed memory order? I would assume that both eventually compile to the same machine code, since a single byte is actually atomic on X64 hardware. Am I wrong?

edited Nov 11 at 18:32

asked Nov 11 at 18:10

tohava

3,47911032

"since a single byte is actually atomic in hardware" - that's not a given fact.
– Jesper Juhl
Nov 11 at 18:30

Not even on X64 architecture? (Note what I wrote in the title)
– tohava
Nov 11 at 18:32

3

@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them, bool would be a word wide, not a byte.)
– Peter Cordes
Nov 11 at 19:21

add a comment |

up vote
1
down vote

favorite

edited Nov 11 at 18:32

asked Nov 11 at 18:10

tohava

3,47911032

"since a single byte is actually atomic in hardware" - that's not a given fact.
– Jesper Juhl
Nov 11 at 18:30

Not even on X64 architecture? (Note what I wrote in the title)
– tohava
Nov 11 at 18:32

3

@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them, bool would be a word wide, not a byte.)
– Peter Cordes
Nov 11 at 19:21

add a comment |

up vote
1
down vote

favorite

edited Nov 11 at 18:32

asked Nov 11 at 18:10

tohava

3,47911032

c++ performance synchronization x86-64 atomic

edited Nov 11 at 18:32

asked Nov 11 at 18:10

tohava

3,47911032

edited Nov 11 at 18:32

asked Nov 11 at 18:10

tohava

3,47911032

edited Nov 11 at 18:32

asked Nov 11 at 18:10

tohava

3,47911032

asked Nov 11 at 18:10

tohava

3,47911032

asked Nov 11 at 18:10

tohava

3,47911032

"since a single byte is actually atomic in hardware" - that's not a given fact.
– Jesper Juhl
Nov 11 at 18:30

Not even on X64 architecture? (Note what I wrote in the title)
– tohava
Nov 11 at 18:32

3

@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them, bool would be a word wide, not a byte.)
– Peter Cordes
Nov 11 at 19:21

add a comment |

"since a single byte is actually atomic in hardware" - that's not a given fact.
– Jesper Juhl
Nov 11 at 18:30

Not even on X64 architecture? (Note what I wrote in the title)
– tohava
Nov 11 at 18:32

3

@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them, bool would be a word wide, not a byte.)
– Peter Cordes
Nov 11 at 19:21

"since a single byte is actually atomic in hardware" - that's not a given fact.
– Jesper Juhl
Nov 11 at 18:30

Not even on X64 architecture? (Note what I wrote in the title)
– tohava
Nov 11 at 18:32

@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them, bool would be a word wide, not a byte.)
– Peter Cordes
Nov 11 at 19:21

add a comment |

2 Answers
2

active

oldest

votes

up vote
4
down vote

accepted

Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<> variable can't be optimized into a register.

If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile.

Current compilers also never combine multiple reads of an atomic variable into one, as if you'd used volatile atomic<T>, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).

This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if around the loop like a normal person.)

int sumarr_atomic(int arr) {

    int sum = 0;

    for(int i=0 ; i<10000 ; i++) {

        if (atomic_bool.load (std::memory_order_relaxed)) {

            sum += arr[i];

        }

    }

    return sum;

}

See the asm output on Godbolt.

But with a non-atomic bool, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).

With atomic_bool, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.

(The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)

Loops over an array of bool can auto-vectorize, but not over atomic<bool> .

Also, inverting a boolean with something like b ^= 1; or b++ can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor or lock btc. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock prefix is also a full memory barrier.)

Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.

void loop() {

    for(int i=0 ; i<10000 ; i++) {

        regular_bool ^= 1;

    }

}

compiles to asm that keeps regular_bool in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.

loop():

    movzx   edx, BYTE PTR regular_bool[rip]   # load into a register

    mov     eax, 10000

.L17:                     # do {

    xor     edx, 1          # flip the boolean

    sub     eax, 1

    jne     .L17          # } while(--i);

    mov     BYTE PTR regular_bool[rip], dl    # store back the result

    ret

Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed) (separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.

edited Nov 12 at 18:47

answered Nov 11 at 19:19

Peter Cordes

117k16180306

add a comment |

up vote
1
down vote

Checking over at Godbolt, loading a regular bool and a std::atomic<bool> generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool> is guaranteed to be either 0 or 1. Strange, that.

Clang does the same thing, although the code generated is slightly different in detail.

edited Nov 11 at 18:39

answered Nov 11 at 18:36

Paul Sanders

4,8801521

Using cout << clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() { return regular_bool; } that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
– Peter Cordes
Nov 11 at 18:39

@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
– Paul Sanders
Nov 11 at 18:40

Yeah I know, and my point is that returning a value from a function instead of writing a main solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
– Peter Cordes
Nov 11 at 18:42

@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
– Paul Sanders
Nov 11 at 18:44

Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them static or inline.
– Peter Cordes
Nov 11 at 18:47

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251703%2fc-c-relaxed-stdatomicbool-vs-unlocked-bool-on-x64-architecture%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
4
down vote

accepted

Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<> variable can't be optimized into a register.

int sumarr_atomic(int arr) {

    int sum = 0;

    for(int i=0 ; i<10000 ; i++) {

        if (atomic_bool.load (std::memory_order_relaxed)) {

            sum += arr[i];

        }

    }

    return sum;

}

See the asm output on Godbolt.

But with a non-atomic bool, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).

Loops over an array of bool can auto-vectorize, but not over atomic<bool> .

Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.

void loop() {

    for(int i=0 ; i<10000 ; i++) {

        regular_bool ^= 1;

    }

}

loop():

    movzx   edx, BYTE PTR regular_bool[rip]   # load into a register

    mov     eax, 10000

.L17:                     # do {

    xor     edx, 1          # flip the boolean

    sub     eax, 1

    jne     .L17          # } while(--i);

    mov     BYTE PTR regular_bool[rip], dl    # store back the result

    ret

edited Nov 12 at 18:47

answered Nov 11 at 19:19

Peter Cordes

117k16180306

add a comment |

up vote
4
down vote

accepted

Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<> variable can't be optimized into a register.

int sumarr_atomic(int arr) {

    int sum = 0;

    for(int i=0 ; i<10000 ; i++) {

        if (atomic_bool.load (std::memory_order_relaxed)) {

            sum += arr[i];

        }

    }

    return sum;

}

See the asm output on Godbolt.

But with a non-atomic bool, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).

Loops over an array of bool can auto-vectorize, but not over atomic<bool> .

Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.

void loop() {

    for(int i=0 ; i<10000 ; i++) {

        regular_bool ^= 1;

    }

}

loop():

    movzx   edx, BYTE PTR regular_bool[rip]   # load into a register

    mov     eax, 10000

.L17:                     # do {

    xor     edx, 1          # flip the boolean

    sub     eax, 1

    jne     .L17          # } while(--i);

    mov     BYTE PTR regular_bool[rip], dl    # store back the result

    ret

edited Nov 12 at 18:47

answered Nov 11 at 19:19

Peter Cordes

117k16180306

add a comment |

up vote
4
down vote

accepted

Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<> variable can't be optimized into a register.

int sumarr_atomic(int arr) {

    int sum = 0;

    for(int i=0 ; i<10000 ; i++) {

        if (atomic_bool.load (std::memory_order_relaxed)) {

            sum += arr[i];

        }

    }

    return sum;

}

See the asm output on Godbolt.

But with a non-atomic bool, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).

Loops over an array of bool can auto-vectorize, but not over atomic<bool> .

Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.

void loop() {

    for(int i=0 ; i<10000 ; i++) {

        regular_bool ^= 1;

    }

}

loop():

    movzx   edx, BYTE PTR regular_bool[rip]   # load into a register

    mov     eax, 10000

.L17:                     # do {

    xor     edx, 1          # flip the boolean

    sub     eax, 1

    jne     .L17          # } while(--i);

    mov     BYTE PTR regular_bool[rip], dl    # store back the result

    ret

edited Nov 12 at 18:47

answered Nov 11 at 19:19

Peter Cordes

117k16180306

Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<> variable can't be optimized into a register.

int sumarr_atomic(int arr) {

    int sum = 0;

    for(int i=0 ; i<10000 ; i++) {

        if (atomic_bool.load (std::memory_order_relaxed)) {

            sum += arr[i];

        }

    }

    return sum;

}

See the asm output on Godbolt.

But with a non-atomic bool, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).

Loops over an array of bool can auto-vectorize, but not over atomic<bool> .

Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.

void loop() {

    for(int i=0 ; i<10000 ; i++) {

        regular_bool ^= 1;

    }

}

loop():

    movzx   edx, BYTE PTR regular_bool[rip]   # load into a register

    mov     eax, 10000

.L17:                     # do {

    xor     edx, 1          # flip the boolean

    sub     eax, 1

    jne     .L17          # } while(--i);

    mov     BYTE PTR regular_bool[rip], dl    # store back the result

    ret

edited Nov 12 at 18:47

answered Nov 11 at 19:19

Peter Cordes

117k16180306

edited Nov 12 at 18:47

answered Nov 11 at 19:19

Peter Cordes

117k16180306

answered Nov 11 at 19:19

Peter Cordes

117k16180306

answered Nov 11 at 19:19

Peter Cordes

117k16180306

add a comment |

up vote
1
down vote

Clang does the same thing, although the code generated is slightly different in detail.

edited Nov 11 at 18:39

answered Nov 11 at 18:36

Paul Sanders

4,8801521

Using cout << clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() { return regular_bool; } that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
– Peter Cordes
Nov 11 at 18:39

@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
– Paul Sanders
Nov 11 at 18:40

Yeah I know, and my point is that returning a value from a function instead of writing a main solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
– Peter Cordes
Nov 11 at 18:42

@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
– Paul Sanders
Nov 11 at 18:44

Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them static or inline.
– Peter Cordes
Nov 11 at 18:47

add a comment |

up vote
1
down vote

Clang does the same thing, although the code generated is slightly different in detail.

edited Nov 11 at 18:39

answered Nov 11 at 18:36

Paul Sanders

4,8801521

Using cout << clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() { return regular_bool; } that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
– Peter Cordes
Nov 11 at 18:39

@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
– Paul Sanders
Nov 11 at 18:40

Yeah I know, and my point is that returning a value from a function instead of writing a main solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
– Peter Cordes
Nov 11 at 18:42

@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
– Paul Sanders
Nov 11 at 18:44

Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them static or inline.
– Peter Cordes
Nov 11 at 18:47

add a comment |

up vote
1
down vote

Clang does the same thing, although the code generated is slightly different in detail.

edited Nov 11 at 18:39

answered Nov 11 at 18:36

Paul Sanders

4,8801521

Clang does the same thing, although the code generated is slightly different in detail.

edited Nov 11 at 18:39

answered Nov 11 at 18:36

Paul Sanders

4,8801521

edited Nov 11 at 18:39

answered Nov 11 at 18:36

Paul Sanders

4,8801521

answered Nov 11 at 18:36

Paul Sanders

4,8801521

answered Nov 11 at 18:36

Paul Sanders

4,8801521

Using cout << clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() { return regular_bool; } that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
– Peter Cordes
Nov 11 at 18:39

@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
– Paul Sanders
Nov 11 at 18:40

Yeah I know, and my point is that returning a value from a function instead of writing a main solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
– Peter Cordes
Nov 11 at 18:42

@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
– Paul Sanders
Nov 11 at 18:44

Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them static or inline.
– Peter Cordes
Nov 11 at 18:47

add a comment |

Using cout << clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() { return regular_bool; } that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
– Peter Cordes
Nov 11 at 18:39

@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
– Paul Sanders
Nov 11 at 18:40

Yeah I know, and my point is that returning a value from a function instead of writing a main solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
– Peter Cordes
Nov 11 at 18:42

@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
– Paul Sanders
Nov 11 at 18:44

Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them static or inline.
– Peter Cordes
Nov 11 at 18:47

Using cout << clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() { return regular_bool; } that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
– Peter Cordes
Nov 11 at 18:39

@Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
– Paul Sanders
Nov 11 at 18:40

Yeah I know, and my point is that returning a value from a function instead of writing a main solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
– Peter Cordes
Nov 11 at 18:42

@Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
– Paul Sanders
Nov 11 at 18:44

Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them static or inline.
– Peter Cordes
Nov 11 at 18:47

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ndtyjky