First clEnqueueMapBuffer call takes much time












1














I have a performance issue in YOLO adoption for OpenCL code.



The method, which just pulls data from device, works slow first time and fast next several calls. There is log with calls, time in microseconds:



clEnqueueMapBuffer      144469
memcpy 2
clEnqueueUnmapMemObject 31
clEnqueueMapBuffer 466
memcpy 103
clEnqueueUnmapMemObject 14
clEnqueueMapBuffer 468
memcpy 106
clEnqueueUnmapMemObject 17


First call is with 1 byte copy (where memcpy takes 2 microseconds).



The memory is allocated by code:



if (!x)
x = (float*) calloc(n, sizeof(float));

buf.ptr = x;

cl_int clErr;
buf.org = clCreateBuffer(opencl_context, CL_MEM_READ_WRITE |
CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, buf.len * buf.obs, buf.ptr, &clErr);


The code for pulling data is next:



#ifdef BENCHMARK
clock_t t;
double time_taken;
t = clock();
#endif
cl_int clErr;
void* map = clEnqueueMapBuffer(opencl_queues[opencl_device_id_t], x_gpu.mem, CL_TRUE, CL_MAP_READ,
0, x_gpu.len * x_gpu.obs, 0, NULL, NULL, &clErr);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueMapBuffert%dn", (int)time_taken);
t = clock();
#endif
if (clErr != CL_SUCCESS)
printf("could not map array to device. error: %sn", clCheckError(clErr));
memcpy(x, map, (n - x_gpu.off) * x_gpu.obs);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("memcpyt%dn", (int)time_taken);
t = clock();
#endif
clErr = clEnqueueUnmapMemObject(opencl_queues[opencl_device_id_t], x_gpu.mem, map, 0, NULL, NULL);
if (clErr != CL_SUCCESS)
printf("could not unmap array from device. error: %sn", clCheckError(clErr));
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueUnmapMemObjectt%dn", (int)time_taken);
#endif


What could be the reason of such delay during first call? How to decrease the delay?










share|improve this question
























  • Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call clEnqueueMapBuffer()? If unsure, try inserting clFinish() before the mapping. (Generally, the clFinish() is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
    – pmdj
    Nov 12 '18 at 18:35










  • The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the CL_MEM_ALLOC_HOST_PTR flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
    – pmdj
    Nov 12 '18 at 18:39












  • You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
    – StasK
    Nov 12 '18 at 18:42












  • Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
    – pmdj
    Nov 12 '18 at 20:42
















1














I have a performance issue in YOLO adoption for OpenCL code.



The method, which just pulls data from device, works slow first time and fast next several calls. There is log with calls, time in microseconds:



clEnqueueMapBuffer      144469
memcpy 2
clEnqueueUnmapMemObject 31
clEnqueueMapBuffer 466
memcpy 103
clEnqueueUnmapMemObject 14
clEnqueueMapBuffer 468
memcpy 106
clEnqueueUnmapMemObject 17


First call is with 1 byte copy (where memcpy takes 2 microseconds).



The memory is allocated by code:



if (!x)
x = (float*) calloc(n, sizeof(float));

buf.ptr = x;

cl_int clErr;
buf.org = clCreateBuffer(opencl_context, CL_MEM_READ_WRITE |
CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, buf.len * buf.obs, buf.ptr, &clErr);


The code for pulling data is next:



#ifdef BENCHMARK
clock_t t;
double time_taken;
t = clock();
#endif
cl_int clErr;
void* map = clEnqueueMapBuffer(opencl_queues[opencl_device_id_t], x_gpu.mem, CL_TRUE, CL_MAP_READ,
0, x_gpu.len * x_gpu.obs, 0, NULL, NULL, &clErr);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueMapBuffert%dn", (int)time_taken);
t = clock();
#endif
if (clErr != CL_SUCCESS)
printf("could not map array to device. error: %sn", clCheckError(clErr));
memcpy(x, map, (n - x_gpu.off) * x_gpu.obs);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("memcpyt%dn", (int)time_taken);
t = clock();
#endif
clErr = clEnqueueUnmapMemObject(opencl_queues[opencl_device_id_t], x_gpu.mem, map, 0, NULL, NULL);
if (clErr != CL_SUCCESS)
printf("could not unmap array from device. error: %sn", clCheckError(clErr));
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueUnmapMemObjectt%dn", (int)time_taken);
#endif


What could be the reason of such delay during first call? How to decrease the delay?










share|improve this question
























  • Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call clEnqueueMapBuffer()? If unsure, try inserting clFinish() before the mapping. (Generally, the clFinish() is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
    – pmdj
    Nov 12 '18 at 18:35










  • The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the CL_MEM_ALLOC_HOST_PTR flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
    – pmdj
    Nov 12 '18 at 18:39












  • You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
    – StasK
    Nov 12 '18 at 18:42












  • Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
    – pmdj
    Nov 12 '18 at 20:42














1












1








1







I have a performance issue in YOLO adoption for OpenCL code.



The method, which just pulls data from device, works slow first time and fast next several calls. There is log with calls, time in microseconds:



clEnqueueMapBuffer      144469
memcpy 2
clEnqueueUnmapMemObject 31
clEnqueueMapBuffer 466
memcpy 103
clEnqueueUnmapMemObject 14
clEnqueueMapBuffer 468
memcpy 106
clEnqueueUnmapMemObject 17


First call is with 1 byte copy (where memcpy takes 2 microseconds).



The memory is allocated by code:



if (!x)
x = (float*) calloc(n, sizeof(float));

buf.ptr = x;

cl_int clErr;
buf.org = clCreateBuffer(opencl_context, CL_MEM_READ_WRITE |
CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, buf.len * buf.obs, buf.ptr, &clErr);


The code for pulling data is next:



#ifdef BENCHMARK
clock_t t;
double time_taken;
t = clock();
#endif
cl_int clErr;
void* map = clEnqueueMapBuffer(opencl_queues[opencl_device_id_t], x_gpu.mem, CL_TRUE, CL_MAP_READ,
0, x_gpu.len * x_gpu.obs, 0, NULL, NULL, &clErr);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueMapBuffert%dn", (int)time_taken);
t = clock();
#endif
if (clErr != CL_SUCCESS)
printf("could not map array to device. error: %sn", clCheckError(clErr));
memcpy(x, map, (n - x_gpu.off) * x_gpu.obs);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("memcpyt%dn", (int)time_taken);
t = clock();
#endif
clErr = clEnqueueUnmapMemObject(opencl_queues[opencl_device_id_t], x_gpu.mem, map, 0, NULL, NULL);
if (clErr != CL_SUCCESS)
printf("could not unmap array from device. error: %sn", clCheckError(clErr));
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueUnmapMemObjectt%dn", (int)time_taken);
#endif


What could be the reason of such delay during first call? How to decrease the delay?










share|improve this question















I have a performance issue in YOLO adoption for OpenCL code.



The method, which just pulls data from device, works slow first time and fast next several calls. There is log with calls, time in microseconds:



clEnqueueMapBuffer      144469
memcpy 2
clEnqueueUnmapMemObject 31
clEnqueueMapBuffer 466
memcpy 103
clEnqueueUnmapMemObject 14
clEnqueueMapBuffer 468
memcpy 106
clEnqueueUnmapMemObject 17


First call is with 1 byte copy (where memcpy takes 2 microseconds).



The memory is allocated by code:



if (!x)
x = (float*) calloc(n, sizeof(float));

buf.ptr = x;

cl_int clErr;
buf.org = clCreateBuffer(opencl_context, CL_MEM_READ_WRITE |
CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, buf.len * buf.obs, buf.ptr, &clErr);


The code for pulling data is next:



#ifdef BENCHMARK
clock_t t;
double time_taken;
t = clock();
#endif
cl_int clErr;
void* map = clEnqueueMapBuffer(opencl_queues[opencl_device_id_t], x_gpu.mem, CL_TRUE, CL_MAP_READ,
0, x_gpu.len * x_gpu.obs, 0, NULL, NULL, &clErr);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueMapBuffert%dn", (int)time_taken);
t = clock();
#endif
if (clErr != CL_SUCCESS)
printf("could not map array to device. error: %sn", clCheckError(clErr));
memcpy(x, map, (n - x_gpu.off) * x_gpu.obs);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("memcpyt%dn", (int)time_taken);
t = clock();
#endif
clErr = clEnqueueUnmapMemObject(opencl_queues[opencl_device_id_t], x_gpu.mem, map, 0, NULL, NULL);
if (clErr != CL_SUCCESS)
printf("could not unmap array from device. error: %sn", clCheckError(clErr));
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueUnmapMemObjectt%dn", (int)time_taken);
#endif


What could be the reason of such delay during first call? How to decrease the delay?







c performance memory-management gpu opencl






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 12 '18 at 18:16









Alon

54028




54028










asked Nov 12 '18 at 18:11









StasK

85




85












  • Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call clEnqueueMapBuffer()? If unsure, try inserting clFinish() before the mapping. (Generally, the clFinish() is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
    – pmdj
    Nov 12 '18 at 18:35










  • The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the CL_MEM_ALLOC_HOST_PTR flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
    – pmdj
    Nov 12 '18 at 18:39












  • You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
    – StasK
    Nov 12 '18 at 18:42












  • Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
    – pmdj
    Nov 12 '18 at 20:42


















  • Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call clEnqueueMapBuffer()? If unsure, try inserting clFinish() before the mapping. (Generally, the clFinish() is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
    – pmdj
    Nov 12 '18 at 18:35










  • The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the CL_MEM_ALLOC_HOST_PTR flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
    – pmdj
    Nov 12 '18 at 18:39












  • You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
    – StasK
    Nov 12 '18 at 18:42












  • Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
    – pmdj
    Nov 12 '18 at 20:42
















Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call clEnqueueMapBuffer()? If unsure, try inserting clFinish() before the mapping. (Generally, the clFinish() is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
– pmdj
Nov 12 '18 at 18:35




Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call clEnqueueMapBuffer()? If unsure, try inserting clFinish() before the mapping. (Generally, the clFinish() is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
– pmdj
Nov 12 '18 at 18:35












The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the CL_MEM_ALLOC_HOST_PTR flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
– pmdj
Nov 12 '18 at 18:39






The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the CL_MEM_ALLOC_HOST_PTR flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
– pmdj
Nov 12 '18 at 18:39














You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
– StasK
Nov 12 '18 at 18:42






You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
– StasK
Nov 12 '18 at 18:42














Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
– pmdj
Nov 12 '18 at 20:42




Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
– pmdj
Nov 12 '18 at 20:42












1 Answer
1






active

oldest

votes


















0














Your clEnqueueMapBuffer() call is blocking (CL_TRUE for the blocking_map parameter) which means that the call will only return once the mapping operation has completed. If your command queue is not concurrent, any previously queued, asynchronous commands such as enqueued kernels, will need to complete before the mapping can even begin. If there are such earlier commands, you are actually measuring their completion as well as the memory-mapping operation. To avoid this, add a clFinish() call before starting your clock. (It is possibly slightly more efficient not to call clFinish(), so I recommend you only leave it in for measurement purposes.)






share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267812%2ffirst-clenqueuemapbuffer-call-takes-much-time%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Your clEnqueueMapBuffer() call is blocking (CL_TRUE for the blocking_map parameter) which means that the call will only return once the mapping operation has completed. If your command queue is not concurrent, any previously queued, asynchronous commands such as enqueued kernels, will need to complete before the mapping can even begin. If there are such earlier commands, you are actually measuring their completion as well as the memory-mapping operation. To avoid this, add a clFinish() call before starting your clock. (It is possibly slightly more efficient not to call clFinish(), so I recommend you only leave it in for measurement purposes.)






    share|improve this answer


























      0














      Your clEnqueueMapBuffer() call is blocking (CL_TRUE for the blocking_map parameter) which means that the call will only return once the mapping operation has completed. If your command queue is not concurrent, any previously queued, asynchronous commands such as enqueued kernels, will need to complete before the mapping can even begin. If there are such earlier commands, you are actually measuring their completion as well as the memory-mapping operation. To avoid this, add a clFinish() call before starting your clock. (It is possibly slightly more efficient not to call clFinish(), so I recommend you only leave it in for measurement purposes.)






      share|improve this answer
























        0












        0








        0






        Your clEnqueueMapBuffer() call is blocking (CL_TRUE for the blocking_map parameter) which means that the call will only return once the mapping operation has completed. If your command queue is not concurrent, any previously queued, asynchronous commands such as enqueued kernels, will need to complete before the mapping can even begin. If there are such earlier commands, you are actually measuring their completion as well as the memory-mapping operation. To avoid this, add a clFinish() call before starting your clock. (It is possibly slightly more efficient not to call clFinish(), so I recommend you only leave it in for measurement purposes.)






        share|improve this answer












        Your clEnqueueMapBuffer() call is blocking (CL_TRUE for the blocking_map parameter) which means that the call will only return once the mapping operation has completed. If your command queue is not concurrent, any previously queued, asynchronous commands such as enqueued kernels, will need to complete before the mapping can even begin. If there are such earlier commands, you are actually measuring their completion as well as the memory-mapping operation. To avoid this, add a clFinish() call before starting your clock. (It is possibly slightly more efficient not to call clFinish(), so I recommend you only leave it in for measurement purposes.)







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 12 '18 at 20:41









        pmdj

        12.7k13284




        12.7k13284






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267812%2ffirst-clenqueuemapbuffer-call-takes-much-time%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Florida Star v. B. J. F.

            Danny Elfman

            Lugert, Oklahoma