Implicit barrier vs nowait in case of two successive pragma omp for












6















Looking at the document here, the following construct is well defined:



#pragma omp parallel          //Line 1
{
#pragma omp for nowait //Line 3
for (i=0; i<N; i++)
a[i] = // some expression
#pragma omp for //Line 6
for (i=0; i<N; i++)
b[i] = ...... a[i] ......
}


since




Here the nowait clause implies that threads can start on the second loop while other threads are still working on the first. Since the two loops use the same schedule here, an iteration that uses a[i] can indeed rely on it that that value has been computed.




I am having a tough time understanding why this would be. Suppose Line 3 were:



#pragma omp for



then, since there is an implicit barrier just before Line 6, the next for loop will have values at all indices of a fully computed. But, with the no wait in Line 3, how would it work?



Suppose, Line 1 triggers 4 threads, t1, t2, t3 and t4. Suppose N is 8 and the partition of indices in the first for loop is thus:



t1: 0, 4
t2: 1, 5
t3: 2, 6
t4: 3, 7


Suppose t1 completes indices 0 and 4 first and lands up at Line 6 What exactly happens now? How is it guaranteed that it now gets to operate on the same indices 0 and 4, for which the a values are correctly computed by it in the previous iteration? What if the second for loop accesses a[i+1]?










share|improve this question



























    6















    Looking at the document here, the following construct is well defined:



    #pragma omp parallel          //Line 1
    {
    #pragma omp for nowait //Line 3
    for (i=0; i<N; i++)
    a[i] = // some expression
    #pragma omp for //Line 6
    for (i=0; i<N; i++)
    b[i] = ...... a[i] ......
    }


    since




    Here the nowait clause implies that threads can start on the second loop while other threads are still working on the first. Since the two loops use the same schedule here, an iteration that uses a[i] can indeed rely on it that that value has been computed.




    I am having a tough time understanding why this would be. Suppose Line 3 were:



    #pragma omp for



    then, since there is an implicit barrier just before Line 6, the next for loop will have values at all indices of a fully computed. But, with the no wait in Line 3, how would it work?



    Suppose, Line 1 triggers 4 threads, t1, t2, t3 and t4. Suppose N is 8 and the partition of indices in the first for loop is thus:



    t1: 0, 4
    t2: 1, 5
    t3: 2, 6
    t4: 3, 7


    Suppose t1 completes indices 0 and 4 first and lands up at Line 6 What exactly happens now? How is it guaranteed that it now gets to operate on the same indices 0 and 4, for which the a values are correctly computed by it in the previous iteration? What if the second for loop accesses a[i+1]?










    share|improve this question

























      6












      6








      6








      Looking at the document here, the following construct is well defined:



      #pragma omp parallel          //Line 1
      {
      #pragma omp for nowait //Line 3
      for (i=0; i<N; i++)
      a[i] = // some expression
      #pragma omp for //Line 6
      for (i=0; i<N; i++)
      b[i] = ...... a[i] ......
      }


      since




      Here the nowait clause implies that threads can start on the second loop while other threads are still working on the first. Since the two loops use the same schedule here, an iteration that uses a[i] can indeed rely on it that that value has been computed.




      I am having a tough time understanding why this would be. Suppose Line 3 were:



      #pragma omp for



      then, since there is an implicit barrier just before Line 6, the next for loop will have values at all indices of a fully computed. But, with the no wait in Line 3, how would it work?



      Suppose, Line 1 triggers 4 threads, t1, t2, t3 and t4. Suppose N is 8 and the partition of indices in the first for loop is thus:



      t1: 0, 4
      t2: 1, 5
      t3: 2, 6
      t4: 3, 7


      Suppose t1 completes indices 0 and 4 first and lands up at Line 6 What exactly happens now? How is it guaranteed that it now gets to operate on the same indices 0 and 4, for which the a values are correctly computed by it in the previous iteration? What if the second for loop accesses a[i+1]?










      share|improve this question














      Looking at the document here, the following construct is well defined:



      #pragma omp parallel          //Line 1
      {
      #pragma omp for nowait //Line 3
      for (i=0; i<N; i++)
      a[i] = // some expression
      #pragma omp for //Line 6
      for (i=0; i<N; i++)
      b[i] = ...... a[i] ......
      }


      since




      Here the nowait clause implies that threads can start on the second loop while other threads are still working on the first. Since the two loops use the same schedule here, an iteration that uses a[i] can indeed rely on it that that value has been computed.




      I am having a tough time understanding why this would be. Suppose Line 3 were:



      #pragma omp for



      then, since there is an implicit barrier just before Line 6, the next for loop will have values at all indices of a fully computed. But, with the no wait in Line 3, how would it work?



      Suppose, Line 1 triggers 4 threads, t1, t2, t3 and t4. Suppose N is 8 and the partition of indices in the first for loop is thus:



      t1: 0, 4
      t2: 1, 5
      t3: 2, 6
      t4: 3, 7


      Suppose t1 completes indices 0 and 4 first and lands up at Line 6 What exactly happens now? How is it guaranteed that it now gets to operate on the same indices 0 and 4, for which the a values are correctly computed by it in the previous iteration? What if the second for loop accesses a[i+1]?







      c++ openmp






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 13 '18 at 8:46









      TryerTryer

      1,09311120




      1,09311120
























          2 Answers
          2






          active

          oldest

          votes


















          7














          The material you quote is wrong. It becomes correct if you add schedule(static) to both loops - this guarantees the same distribution of indices among threads for successive loops. The default schedule is implementation defined, you cannot assume it to be static. To quote the standard:




          Different loop regions with the same schedule and iteration count,
          even if they occur in the same parallel region, can distribute
          iterations among threads differently. The only exception is for the
          static schedule as specified in Table 2.5. Programs that depend on
          which thread executes a particular iteration under any other
          circumstances are non-conforming.




          If the second for loop accesses a[i+1] you must absolutely leave the barrier there.






          share|improve this answer































            2














            To me the statement that there is no potential problem in the example is wrong.



            Indeed, scheduling will be the same as it is not explicitly defined. It will be the default one. Furthermore, if the scheduling was of static type, then indeed, there wouldn't be any issue since the thread that would handle any given data in array a inside the second loop would be the same as the one which would have written it in the first loop.



            But the actual problem here is that the default scheduling is not defined by the OpenMP standard. This is implementation defined... For the (many) implementations where the default scheduling is static, there cannot be any race condition in the snippet. But if the default scheduling is dynamic, then, as you notice, a race condition can happen and the result is undefined.






            share|improve this answer
























            • what if the second loop was like for(int i = 0, i < N/2; i++). How would the allocation be maintained consistently?

              – Tryer
              Nov 13 '18 at 9:04






            • 3





              It would not - both loops have to use the same iteration count and same static schedule.

              – Zulan
              Nov 13 '18 at 9:11











            • Exactly as @Zulan said.

              – Gilles
              Nov 13 '18 at 9:18











            • What about the following type of second loop: for(int i = 1; i < N+1; i++) b[i-1] = a[i-1] vs for(int i = 1; i < N+1; i++) b[i] = a[i]; Both seem to satisfy the standard if I understand it correctly (iteration count is N), so, how would the iterations be partitioned ?

              – Tryer
              Nov 13 '18 at 9:18













            • The fact that there is no race condition in your initial example with static scheduling isn't just due to identical looping, it is also (as I mentioned) because this is the thread which wrote a data in a that reads it subsequently, and it alone. Your second example doesn't follow this rule, and isn't safe.

              – Gilles
              Nov 13 '18 at 9:22











            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53277030%2fimplicit-barrier-vs-nowait-in-case-of-two-successive-pragma-omp-for%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            7














            The material you quote is wrong. It becomes correct if you add schedule(static) to both loops - this guarantees the same distribution of indices among threads for successive loops. The default schedule is implementation defined, you cannot assume it to be static. To quote the standard:




            Different loop regions with the same schedule and iteration count,
            even if they occur in the same parallel region, can distribute
            iterations among threads differently. The only exception is for the
            static schedule as specified in Table 2.5. Programs that depend on
            which thread executes a particular iteration under any other
            circumstances are non-conforming.




            If the second for loop accesses a[i+1] you must absolutely leave the barrier there.






            share|improve this answer




























              7














              The material you quote is wrong. It becomes correct if you add schedule(static) to both loops - this guarantees the same distribution of indices among threads for successive loops. The default schedule is implementation defined, you cannot assume it to be static. To quote the standard:




              Different loop regions with the same schedule and iteration count,
              even if they occur in the same parallel region, can distribute
              iterations among threads differently. The only exception is for the
              static schedule as specified in Table 2.5. Programs that depend on
              which thread executes a particular iteration under any other
              circumstances are non-conforming.




              If the second for loop accesses a[i+1] you must absolutely leave the barrier there.






              share|improve this answer


























                7












                7








                7







                The material you quote is wrong. It becomes correct if you add schedule(static) to both loops - this guarantees the same distribution of indices among threads for successive loops. The default schedule is implementation defined, you cannot assume it to be static. To quote the standard:




                Different loop regions with the same schedule and iteration count,
                even if they occur in the same parallel region, can distribute
                iterations among threads differently. The only exception is for the
                static schedule as specified in Table 2.5. Programs that depend on
                which thread executes a particular iteration under any other
                circumstances are non-conforming.




                If the second for loop accesses a[i+1] you must absolutely leave the barrier there.






                share|improve this answer













                The material you quote is wrong. It becomes correct if you add schedule(static) to both loops - this guarantees the same distribution of indices among threads for successive loops. The default schedule is implementation defined, you cannot assume it to be static. To quote the standard:




                Different loop regions with the same schedule and iteration count,
                even if they occur in the same parallel region, can distribute
                iterations among threads differently. The only exception is for the
                static schedule as specified in Table 2.5. Programs that depend on
                which thread executes a particular iteration under any other
                circumstances are non-conforming.




                If the second for loop accesses a[i+1] you must absolutely leave the barrier there.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 13 '18 at 8:57









                ZulanZulan

                15.3k63070




                15.3k63070

























                    2














                    To me the statement that there is no potential problem in the example is wrong.



                    Indeed, scheduling will be the same as it is not explicitly defined. It will be the default one. Furthermore, if the scheduling was of static type, then indeed, there wouldn't be any issue since the thread that would handle any given data in array a inside the second loop would be the same as the one which would have written it in the first loop.



                    But the actual problem here is that the default scheduling is not defined by the OpenMP standard. This is implementation defined... For the (many) implementations where the default scheduling is static, there cannot be any race condition in the snippet. But if the default scheduling is dynamic, then, as you notice, a race condition can happen and the result is undefined.






                    share|improve this answer
























                    • what if the second loop was like for(int i = 0, i < N/2; i++). How would the allocation be maintained consistently?

                      – Tryer
                      Nov 13 '18 at 9:04






                    • 3





                      It would not - both loops have to use the same iteration count and same static schedule.

                      – Zulan
                      Nov 13 '18 at 9:11











                    • Exactly as @Zulan said.

                      – Gilles
                      Nov 13 '18 at 9:18











                    • What about the following type of second loop: for(int i = 1; i < N+1; i++) b[i-1] = a[i-1] vs for(int i = 1; i < N+1; i++) b[i] = a[i]; Both seem to satisfy the standard if I understand it correctly (iteration count is N), so, how would the iterations be partitioned ?

                      – Tryer
                      Nov 13 '18 at 9:18













                    • The fact that there is no race condition in your initial example with static scheduling isn't just due to identical looping, it is also (as I mentioned) because this is the thread which wrote a data in a that reads it subsequently, and it alone. Your second example doesn't follow this rule, and isn't safe.

                      – Gilles
                      Nov 13 '18 at 9:22
















                    2














                    To me the statement that there is no potential problem in the example is wrong.



                    Indeed, scheduling will be the same as it is not explicitly defined. It will be the default one. Furthermore, if the scheduling was of static type, then indeed, there wouldn't be any issue since the thread that would handle any given data in array a inside the second loop would be the same as the one which would have written it in the first loop.



                    But the actual problem here is that the default scheduling is not defined by the OpenMP standard. This is implementation defined... For the (many) implementations where the default scheduling is static, there cannot be any race condition in the snippet. But if the default scheduling is dynamic, then, as you notice, a race condition can happen and the result is undefined.






                    share|improve this answer
























                    • what if the second loop was like for(int i = 0, i < N/2; i++). How would the allocation be maintained consistently?

                      – Tryer
                      Nov 13 '18 at 9:04






                    • 3





                      It would not - both loops have to use the same iteration count and same static schedule.

                      – Zulan
                      Nov 13 '18 at 9:11











                    • Exactly as @Zulan said.

                      – Gilles
                      Nov 13 '18 at 9:18











                    • What about the following type of second loop: for(int i = 1; i < N+1; i++) b[i-1] = a[i-1] vs for(int i = 1; i < N+1; i++) b[i] = a[i]; Both seem to satisfy the standard if I understand it correctly (iteration count is N), so, how would the iterations be partitioned ?

                      – Tryer
                      Nov 13 '18 at 9:18













                    • The fact that there is no race condition in your initial example with static scheduling isn't just due to identical looping, it is also (as I mentioned) because this is the thread which wrote a data in a that reads it subsequently, and it alone. Your second example doesn't follow this rule, and isn't safe.

                      – Gilles
                      Nov 13 '18 at 9:22














                    2












                    2








                    2







                    To me the statement that there is no potential problem in the example is wrong.



                    Indeed, scheduling will be the same as it is not explicitly defined. It will be the default one. Furthermore, if the scheduling was of static type, then indeed, there wouldn't be any issue since the thread that would handle any given data in array a inside the second loop would be the same as the one which would have written it in the first loop.



                    But the actual problem here is that the default scheduling is not defined by the OpenMP standard. This is implementation defined... For the (many) implementations where the default scheduling is static, there cannot be any race condition in the snippet. But if the default scheduling is dynamic, then, as you notice, a race condition can happen and the result is undefined.






                    share|improve this answer













                    To me the statement that there is no potential problem in the example is wrong.



                    Indeed, scheduling will be the same as it is not explicitly defined. It will be the default one. Furthermore, if the scheduling was of static type, then indeed, there wouldn't be any issue since the thread that would handle any given data in array a inside the second loop would be the same as the one which would have written it in the first loop.



                    But the actual problem here is that the default scheduling is not defined by the OpenMP standard. This is implementation defined... For the (many) implementations where the default scheduling is static, there cannot be any race condition in the snippet. But if the default scheduling is dynamic, then, as you notice, a race condition can happen and the result is undefined.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 13 '18 at 8:59









                    GillesGilles

                    6,59232340




                    6,59232340













                    • what if the second loop was like for(int i = 0, i < N/2; i++). How would the allocation be maintained consistently?

                      – Tryer
                      Nov 13 '18 at 9:04






                    • 3





                      It would not - both loops have to use the same iteration count and same static schedule.

                      – Zulan
                      Nov 13 '18 at 9:11











                    • Exactly as @Zulan said.

                      – Gilles
                      Nov 13 '18 at 9:18











                    • What about the following type of second loop: for(int i = 1; i < N+1; i++) b[i-1] = a[i-1] vs for(int i = 1; i < N+1; i++) b[i] = a[i]; Both seem to satisfy the standard if I understand it correctly (iteration count is N), so, how would the iterations be partitioned ?

                      – Tryer
                      Nov 13 '18 at 9:18













                    • The fact that there is no race condition in your initial example with static scheduling isn't just due to identical looping, it is also (as I mentioned) because this is the thread which wrote a data in a that reads it subsequently, and it alone. Your second example doesn't follow this rule, and isn't safe.

                      – Gilles
                      Nov 13 '18 at 9:22



















                    • what if the second loop was like for(int i = 0, i < N/2; i++). How would the allocation be maintained consistently?

                      – Tryer
                      Nov 13 '18 at 9:04






                    • 3





                      It would not - both loops have to use the same iteration count and same static schedule.

                      – Zulan
                      Nov 13 '18 at 9:11











                    • Exactly as @Zulan said.

                      – Gilles
                      Nov 13 '18 at 9:18











                    • What about the following type of second loop: for(int i = 1; i < N+1; i++) b[i-1] = a[i-1] vs for(int i = 1; i < N+1; i++) b[i] = a[i]; Both seem to satisfy the standard if I understand it correctly (iteration count is N), so, how would the iterations be partitioned ?

                      – Tryer
                      Nov 13 '18 at 9:18













                    • The fact that there is no race condition in your initial example with static scheduling isn't just due to identical looping, it is also (as I mentioned) because this is the thread which wrote a data in a that reads it subsequently, and it alone. Your second example doesn't follow this rule, and isn't safe.

                      – Gilles
                      Nov 13 '18 at 9:22

















                    what if the second loop was like for(int i = 0, i < N/2; i++). How would the allocation be maintained consistently?

                    – Tryer
                    Nov 13 '18 at 9:04





                    what if the second loop was like for(int i = 0, i < N/2; i++). How would the allocation be maintained consistently?

                    – Tryer
                    Nov 13 '18 at 9:04




                    3




                    3





                    It would not - both loops have to use the same iteration count and same static schedule.

                    – Zulan
                    Nov 13 '18 at 9:11





                    It would not - both loops have to use the same iteration count and same static schedule.

                    – Zulan
                    Nov 13 '18 at 9:11













                    Exactly as @Zulan said.

                    – Gilles
                    Nov 13 '18 at 9:18





                    Exactly as @Zulan said.

                    – Gilles
                    Nov 13 '18 at 9:18













                    What about the following type of second loop: for(int i = 1; i < N+1; i++) b[i-1] = a[i-1] vs for(int i = 1; i < N+1; i++) b[i] = a[i]; Both seem to satisfy the standard if I understand it correctly (iteration count is N), so, how would the iterations be partitioned ?

                    – Tryer
                    Nov 13 '18 at 9:18







                    What about the following type of second loop: for(int i = 1; i < N+1; i++) b[i-1] = a[i-1] vs for(int i = 1; i < N+1; i++) b[i] = a[i]; Both seem to satisfy the standard if I understand it correctly (iteration count is N), so, how would the iterations be partitioned ?

                    – Tryer
                    Nov 13 '18 at 9:18















                    The fact that there is no race condition in your initial example with static scheduling isn't just due to identical looping, it is also (as I mentioned) because this is the thread which wrote a data in a that reads it subsequently, and it alone. Your second example doesn't follow this rule, and isn't safe.

                    – Gilles
                    Nov 13 '18 at 9:22





                    The fact that there is no race condition in your initial example with static scheduling isn't just due to identical looping, it is also (as I mentioned) because this is the thread which wrote a data in a that reads it subsequently, and it alone. Your second example doesn't follow this rule, and isn't safe.

                    – Gilles
                    Nov 13 '18 at 9:22


















                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53277030%2fimplicit-barrier-vs-nowait-in-case-of-two-successive-pragma-omp-for%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Florida Star v. B. J. F.

                    Danny Elfman

                    Lugert, Oklahoma