Implicit barrier vs nowait in case of two successive pragma omp for

Looking at the document here, the following construct is well defined:

#pragma omp parallel          //Line 1

{

#pragma omp for nowait        //Line 3

  for (i=0; i<N; i++)

    a[i] = // some expression

#pragma omp for               //Line 6

  for (i=0; i<N; i++)

    b[i] = ...... a[i] ......

}

since

Here the nowait clause implies that threads can start on the second loop while other threads are still working on the first. Since the two loops use the same schedule here, an iteration that uses a[i] can indeed rely on it that that value has been computed.

I am having a tough time understanding why this would be. Suppose Line 3 were:

#pragma omp for

then, since there is an implicit barrier just before Line 6, the next for loop will have values at all indices of a fully computed. But, with the no wait in Line 3, how would it work?

Suppose, Line 1 triggers 4 threads, t1, t2, t3 and t4. Suppose N is 8 and the partition of indices in the first for loop is thus:

t1: 0, 4

t2: 1, 5

t3: 2, 6

t4: 3, 7

Suppose t1 completes indices 0 and 4 first and lands up at Line 6 What exactly happens now? How is it guaranteed that it now gets to operate on the same indices 0 and 4, for which the a values are correctly computed by it in the previous iteration? What if the second for loop accesses a[i+1]?

asked Nov 13 '18 at 8:46

Tryer

1,09311120

add a comment |

Looking at the document here, the following construct is well defined:

#pragma omp parallel          //Line 1

{

#pragma omp for nowait        //Line 3

  for (i=0; i<N; i++)

    a[i] = // some expression

#pragma omp for               //Line 6

  for (i=0; i<N; i++)

    b[i] = ...... a[i] ......

}

since

Here the nowait clause implies that threads can start on the second loop while other threads are still working on the first. Since the two loops use the same schedule here, an iteration that uses a[i] can indeed rely on it that that value has been computed.

I am having a tough time understanding why this would be. Suppose Line 3 were:

#pragma omp for

then, since there is an implicit barrier just before Line 6, the next for loop will have values at all indices of a fully computed. But, with the no wait in Line 3, how would it work?

Suppose, Line 1 triggers 4 threads, t1, t2, t3 and t4. Suppose N is 8 and the partition of indices in the first for loop is thus:

t1: 0, 4

t2: 1, 5

t3: 2, 6

t4: 3, 7

asked Nov 13 '18 at 8:46

Tryer

1,09311120

add a comment |

Looking at the document here, the following construct is well defined:

#pragma omp parallel          //Line 1

{

#pragma omp for nowait        //Line 3

  for (i=0; i<N; i++)

    a[i] = // some expression

#pragma omp for               //Line 6

  for (i=0; i<N; i++)

    b[i] = ...... a[i] ......

}

since

Here the nowait clause implies that threads can start on the second loop while other threads are still working on the first. Since the two loops use the same schedule here, an iteration that uses a[i] can indeed rely on it that that value has been computed.

I am having a tough time understanding why this would be. Suppose Line 3 were:

#pragma omp for

then, since there is an implicit barrier just before Line 6, the next for loop will have values at all indices of a fully computed. But, with the no wait in Line 3, how would it work?

Suppose, Line 1 triggers 4 threads, t1, t2, t3 and t4. Suppose N is 8 and the partition of indices in the first for loop is thus:

t1: 0, 4

t2: 1, 5

t3: 2, 6

t4: 3, 7

asked Nov 13 '18 at 8:46

Tryer

1,09311120

Looking at the document here, the following construct is well defined:

#pragma omp parallel          //Line 1

{

#pragma omp for nowait        //Line 3

  for (i=0; i<N; i++)

    a[i] = // some expression

#pragma omp for               //Line 6

  for (i=0; i<N; i++)

    b[i] = ...... a[i] ......

}

since

Here the nowait clause implies that threads can start on the second loop while other threads are still working on the first. Since the two loops use the same schedule here, an iteration that uses a[i] can indeed rely on it that that value has been computed.

I am having a tough time understanding why this would be. Suppose Line 3 were:

#pragma omp for

then, since there is an implicit barrier just before Line 6, the next for loop will have values at all indices of a fully computed. But, with the no wait in Line 3, how would it work?

Suppose, Line 1 triggers 4 threads, t1, t2, t3 and t4. Suppose N is 8 and the partition of indices in the first for loop is thus:

t1: 0, 4

t2: 1, 5

t3: 2, 6

t4: 3, 7

c++ openmp

asked Nov 13 '18 at 8:46

Tryer

1,09311120

asked Nov 13 '18 at 8:46

Tryer

1,09311120

asked Nov 13 '18 at 8:46

Tryer

1,09311120

asked Nov 13 '18 at 8:46

Tryer

1,09311120

asked Nov 13 '18 at 8:46

Tryer

1,09311120

add a comment |

2 Answers
2

active

oldest

votes

The material you quote is wrong. It becomes correct if you add schedule(static) to both loops - this guarantees the same distribution of indices among threads for successive loops. The default schedule is implementation defined, you cannot assume it to be static. To quote the standard:

Different loop regions with the same schedule and iteration count,
even if they occur in the same parallel region, can distribute
iterations among threads differently. The only exception is for the
static schedule as specified in Table 2.5. Programs that depend on
which thread executes a particular iteration under any other
circumstances are non-conforming.

If the second for loop accesses a[i+1] you must absolutely leave the barrier there.

answered Nov 13 '18 at 8:57

Zulan

15.3k63070

add a comment |

To me the statement that there is no potential problem in the example is wrong.

Indeed, scheduling will be the same as it is not explicitly defined. It will be the default one. Furthermore, if the scheduling was of static type, then indeed, there wouldn't be any issue since the thread that would handle any given data in array a inside the second loop would be the same as the one which would have written it in the first loop.

But the actual problem here is that the default scheduling is not defined by the OpenMP standard. This is implementation defined... For the (many) implementations where the default scheduling is static, there cannot be any race condition in the snippet. But if the default scheduling is dynamic, then, as you notice, a race condition can happen and the result is undefined.

answered Nov 13 '18 at 8:59

Gilles

6,59232340

what if the second loop was like for(int i = 0, i < N/2; i++). How would the allocation be maintained consistently?

– Tryer
Nov 13 '18 at 9:04

3

It would not - both loops have to use the same iteration count and same static schedule.

– Zulan
Nov 13 '18 at 9:11

Exactly as @Zulan said.

– Gilles
Nov 13 '18 at 9:18

What about the following type of second loop: for(int i = 1; i < N+1; i++) b[i-1] = a[i-1] vs for(int i = 1; i < N+1; i++) b[i] = a[i]; Both seem to satisfy the standard if I understand it correctly (iteration count is N), so, how would the iterations be partitioned ?

– Tryer
Nov 13 '18 at 9:18

The fact that there is no race condition in your initial example with static scheduling isn't just due to identical looping, it is also (as I mentioned) because this is the thread which wrote a data in a that reads it subsequently, and it alone. Your second example doesn't follow this rule, and isn't safe.

– Gilles
Nov 13 '18 at 9:22

|
show 1 more comment

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53277030%2fimplicit-barrier-vs-nowait-in-case-of-two-successive-pragma-omp-for%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

Different loop regions with the same schedule and iteration count,
even if they occur in the same parallel region, can distribute
iterations among threads differently. The only exception is for the
static schedule as specified in Table 2.5. Programs that depend on
which thread executes a particular iteration under any other
circumstances are non-conforming.

If the second for loop accesses a[i+1] you must absolutely leave the barrier there.

answered Nov 13 '18 at 8:57

Zulan

15.3k63070

add a comment |

Different loop regions with the same schedule and iteration count,
even if they occur in the same parallel region, can distribute
iterations among threads differently. The only exception is for the
static schedule as specified in Table 2.5. Programs that depend on
which thread executes a particular iteration under any other
circumstances are non-conforming.

If the second for loop accesses a[i+1] you must absolutely leave the barrier there.

answered Nov 13 '18 at 8:57

Zulan

15.3k63070

add a comment |

Different loop regions with the same schedule and iteration count,
even if they occur in the same parallel region, can distribute
iterations among threads differently. The only exception is for the
static schedule as specified in Table 2.5. Programs that depend on
which thread executes a particular iteration under any other
circumstances are non-conforming.

If the second for loop accesses a[i+1] you must absolutely leave the barrier there.

answered Nov 13 '18 at 8:57

Zulan

15.3k63070

Different loop regions with the same schedule and iteration count,
even if they occur in the same parallel region, can distribute
iterations among threads differently. The only exception is for the
static schedule as specified in Table 2.5. Programs that depend on
which thread executes a particular iteration under any other
circumstances are non-conforming.

If the second for loop accesses a[i+1] you must absolutely leave the barrier there.

answered Nov 13 '18 at 8:57

Zulan

15.3k63070

answered Nov 13 '18 at 8:57

Zulan

15.3k63070

answered Nov 13 '18 at 8:57

Zulan

15.3k63070

answered Nov 13 '18 at 8:57

Zulan

15.3k63070

add a comment |

To me the statement that there is no potential problem in the example is wrong.

answered Nov 13 '18 at 8:59

Gilles

6,59232340

what if the second loop was like for(int i = 0, i < N/2; i++). How would the allocation be maintained consistently?

– Tryer
Nov 13 '18 at 9:04

3

It would not - both loops have to use the same iteration count and same static schedule.

– Zulan
Nov 13 '18 at 9:11

Exactly as @Zulan said.

– Gilles
Nov 13 '18 at 9:18

What about the following type of second loop: for(int i = 1; i < N+1; i++) b[i-1] = a[i-1] vs for(int i = 1; i < N+1; i++) b[i] = a[i]; Both seem to satisfy the standard if I understand it correctly (iteration count is N), so, how would the iterations be partitioned ?

– Tryer
Nov 13 '18 at 9:18

The fact that there is no race condition in your initial example with static scheduling isn't just due to identical looping, it is also (as I mentioned) because this is the thread which wrote a data in a that reads it subsequently, and it alone. Your second example doesn't follow this rule, and isn't safe.

– Gilles
Nov 13 '18 at 9:22

|
show 1 more comment

To me the statement that there is no potential problem in the example is wrong.

answered Nov 13 '18 at 8:59

Gilles

6,59232340

what if the second loop was like for(int i = 0, i < N/2; i++). How would the allocation be maintained consistently?

– Tryer
Nov 13 '18 at 9:04

3

It would not - both loops have to use the same iteration count and same static schedule.

– Zulan
Nov 13 '18 at 9:11

Exactly as @Zulan said.

– Gilles
Nov 13 '18 at 9:18

What about the following type of second loop: for(int i = 1; i < N+1; i++) b[i-1] = a[i-1] vs for(int i = 1; i < N+1; i++) b[i] = a[i]; Both seem to satisfy the standard if I understand it correctly (iteration count is N), so, how would the iterations be partitioned ?

– Tryer
Nov 13 '18 at 9:18

The fact that there is no race condition in your initial example with static scheduling isn't just due to identical looping, it is also (as I mentioned) because this is the thread which wrote a data in a that reads it subsequently, and it alone. Your second example doesn't follow this rule, and isn't safe.

– Gilles
Nov 13 '18 at 9:22

|
show 1 more comment

To me the statement that there is no potential problem in the example is wrong.

answered Nov 13 '18 at 8:59

Gilles

6,59232340

To me the statement that there is no potential problem in the example is wrong.

answered Nov 13 '18 at 8:59

Gilles

6,59232340

answered Nov 13 '18 at 8:59

Gilles

6,59232340

answered Nov 13 '18 at 8:59

Gilles

6,59232340

answered Nov 13 '18 at 8:59

Gilles

6,59232340

what if the second loop was like for(int i = 0, i < N/2; i++). How would the allocation be maintained consistently?

– Tryer
Nov 13 '18 at 9:04

3

It would not - both loops have to use the same iteration count and same static schedule.

– Zulan
Nov 13 '18 at 9:11

Exactly as @Zulan said.

– Gilles
Nov 13 '18 at 9:18

What about the following type of second loop: for(int i = 1; i < N+1; i++) b[i-1] = a[i-1] vs for(int i = 1; i < N+1; i++) b[i] = a[i]; Both seem to satisfy the standard if I understand it correctly (iteration count is N), so, how would the iterations be partitioned ?

– Tryer
Nov 13 '18 at 9:18

The fact that there is no race condition in your initial example with static scheduling isn't just due to identical looping, it is also (as I mentioned) because this is the thread which wrote a data in a that reads it subsequently, and it alone. Your second example doesn't follow this rule, and isn't safe.

– Gilles
Nov 13 '18 at 9:22

|
show 1 more comment

what if the second loop was like for(int i = 0, i < N/2; i++). How would the allocation be maintained consistently?

– Tryer
Nov 13 '18 at 9:04

3

It would not - both loops have to use the same iteration count and same static schedule.

– Zulan
Nov 13 '18 at 9:11

Exactly as @Zulan said.

– Gilles
Nov 13 '18 at 9:18

What about the following type of second loop: for(int i = 1; i < N+1; i++) b[i-1] = a[i-1] vs for(int i = 1; i < N+1; i++) b[i] = a[i]; Both seem to satisfy the standard if I understand it correctly (iteration count is N), so, how would the iterations be partitioned ?

– Tryer
Nov 13 '18 at 9:18

The fact that there is no race condition in your initial example with static scheduling isn't just due to identical looping, it is also (as I mentioned) because this is the thread which wrote a data in a that reads it subsequently, and it alone. Your second example doesn't follow this rule, and isn't safe.

– Gilles
Nov 13 '18 at 9:22

what if the second loop was like for(int i = 0, i < N/2; i++). How would the allocation be maintained consistently?

– Tryer
Nov 13 '18 at 9:04

It would not - both loops have to use the same iteration count and same static schedule.

– Zulan
Nov 13 '18 at 9:11

Exactly as @Zulan said.

– Gilles
Nov 13 '18 at 9:18

What about the following type of second loop: for(int i = 1; i < N+1; i++) b[i-1] = a[i-1] vs for(int i = 1; i < N+1; i++) b[i] = a[i]; Both seem to satisfy the standard if I understand it correctly (iteration count is N), so, how would the iterations be partitioned ?

– Tryer
Nov 13 '18 at 9:18

The fact that there is no race condition in your initial example with static scheduling isn't just due to identical looping, it is also (as I mentioned) because this is the thread which wrote a data in a that reads it subsequently, and it alone. Your second example doesn't follow this rule, and isn't safe.

– Gilles
Nov 13 '18 at 9:22

|
show 1 more comment

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ndtyjky