First clEnqueueMapBuffer call takes much time

I have a performance issue in YOLO adoption for OpenCL code.

The method, which just pulls data from device, works slow first time and fast next several calls. There is log with calls, time in microseconds:

clEnqueueMapBuffer      144469

memcpy  2

clEnqueueUnmapMemObject 31

clEnqueueMapBuffer      466

memcpy  103

clEnqueueUnmapMemObject 14

clEnqueueMapBuffer      468

memcpy  106

clEnqueueUnmapMemObject 17

First call is with 1 byte copy (where memcpy takes 2 microseconds).

The memory is allocated by code:

if (!x)

    x = (float*) calloc(n, sizeof(float));



buf.ptr = x;



cl_int clErr;

buf.org = clCreateBuffer(opencl_context, CL_MEM_READ_WRITE | 

CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, buf.len * buf.obs, buf.ptr, &clErr);

The code for pulling data is next:

#ifdef BENCHMARK

    clock_t t;

    double time_taken;

    t = clock();

#endif

    cl_int clErr;

    void* map = clEnqueueMapBuffer(opencl_queues[opencl_device_id_t], x_gpu.mem, CL_TRUE, CL_MAP_READ,

                                   0, x_gpu.len * x_gpu.obs, 0, NULL, NULL, &clErr);

#ifdef BENCHMARK

    t = clock() - t;

    time_taken = ((double)t);

    printf("clEnqueueMapBuffert%dn", (int)time_taken);

    t = clock();

#endif

    if (clErr != CL_SUCCESS)

        printf("could not map array to device. error: %sn", clCheckError(clErr));

    memcpy(x, map, (n - x_gpu.off) * x_gpu.obs);

#ifdef BENCHMARK

    t = clock() - t;

    time_taken = ((double)t);

    printf("memcpyt%dn", (int)time_taken);

    t = clock();

#endif

    clErr = clEnqueueUnmapMemObject(opencl_queues[opencl_device_id_t], x_gpu.mem, map, 0, NULL, NULL);

    if (clErr != CL_SUCCESS)

        printf("could not unmap array from device. error: %sn", clCheckError(clErr));

#ifdef BENCHMARK

    t = clock() - t;

    time_taken = ((double)t);

    printf("clEnqueueUnmapMemObjectt%dn", (int)time_taken);

#endif

What could be the reason of such delay during first call? How to decrease the delay?

edited Nov 12 '18 at 18:16

Alon

54028

asked Nov 12 '18 at 18:11

StasK

Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call clEnqueueMapBuffer()? If unsure, try inserting clFinish() before the mapping. (Generally, the clFinish() is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
– pmdj
Nov 12 '18 at 18:35

The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the CL_MEM_ALLOC_HOST_PTR flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
– pmdj
Nov 12 '18 at 18:39

You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
– StasK
Nov 12 '18 at 18:42

Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
– pmdj
Nov 12 '18 at 20:42

add a comment |

I have a performance issue in YOLO adoption for OpenCL code.

The method, which just pulls data from device, works slow first time and fast next several calls. There is log with calls, time in microseconds:

clEnqueueMapBuffer      144469

memcpy  2

clEnqueueUnmapMemObject 31

clEnqueueMapBuffer      466

memcpy  103

clEnqueueUnmapMemObject 14

clEnqueueMapBuffer      468

memcpy  106

clEnqueueUnmapMemObject 17

First call is with 1 byte copy (where memcpy takes 2 microseconds).

The memory is allocated by code:

if (!x)

    x = (float*) calloc(n, sizeof(float));



buf.ptr = x;



cl_int clErr;

buf.org = clCreateBuffer(opencl_context, CL_MEM_READ_WRITE | 

CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, buf.len * buf.obs, buf.ptr, &clErr);

The code for pulling data is next:

#ifdef BENCHMARK

    clock_t t;

    double time_taken;

    t = clock();

#endif

    cl_int clErr;

    void* map = clEnqueueMapBuffer(opencl_queues[opencl_device_id_t], x_gpu.mem, CL_TRUE, CL_MAP_READ,

                                   0, x_gpu.len * x_gpu.obs, 0, NULL, NULL, &clErr);

#ifdef BENCHMARK

    t = clock() - t;

    time_taken = ((double)t);

    printf("clEnqueueMapBuffert%dn", (int)time_taken);

    t = clock();

#endif

    if (clErr != CL_SUCCESS)

        printf("could not map array to device. error: %sn", clCheckError(clErr));

    memcpy(x, map, (n - x_gpu.off) * x_gpu.obs);

#ifdef BENCHMARK

    t = clock() - t;

    time_taken = ((double)t);

    printf("memcpyt%dn", (int)time_taken);

    t = clock();

#endif

    clErr = clEnqueueUnmapMemObject(opencl_queues[opencl_device_id_t], x_gpu.mem, map, 0, NULL, NULL);

    if (clErr != CL_SUCCESS)

        printf("could not unmap array from device. error: %sn", clCheckError(clErr));

#ifdef BENCHMARK

    t = clock() - t;

    time_taken = ((double)t);

    printf("clEnqueueUnmapMemObjectt%dn", (int)time_taken);

#endif

What could be the reason of such delay during first call? How to decrease the delay?

edited Nov 12 '18 at 18:16

Alon

54028

asked Nov 12 '18 at 18:11

StasK

Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call clEnqueueMapBuffer()? If unsure, try inserting clFinish() before the mapping. (Generally, the clFinish() is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
– pmdj
Nov 12 '18 at 18:35

The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the CL_MEM_ALLOC_HOST_PTR flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
– pmdj
Nov 12 '18 at 18:39

You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
– StasK
Nov 12 '18 at 18:42

Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
– pmdj
Nov 12 '18 at 20:42

add a comment |

I have a performance issue in YOLO adoption for OpenCL code.

The method, which just pulls data from device, works slow first time and fast next several calls. There is log with calls, time in microseconds:

clEnqueueMapBuffer      144469

memcpy  2

clEnqueueUnmapMemObject 31

clEnqueueMapBuffer      466

memcpy  103

clEnqueueUnmapMemObject 14

clEnqueueMapBuffer      468

memcpy  106

clEnqueueUnmapMemObject 17

First call is with 1 byte copy (where memcpy takes 2 microseconds).

The memory is allocated by code:

if (!x)

    x = (float*) calloc(n, sizeof(float));



buf.ptr = x;



cl_int clErr;

buf.org = clCreateBuffer(opencl_context, CL_MEM_READ_WRITE | 

CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, buf.len * buf.obs, buf.ptr, &clErr);

The code for pulling data is next:

#ifdef BENCHMARK

    clock_t t;

    double time_taken;

    t = clock();

#endif

    cl_int clErr;

    void* map = clEnqueueMapBuffer(opencl_queues[opencl_device_id_t], x_gpu.mem, CL_TRUE, CL_MAP_READ,

                                   0, x_gpu.len * x_gpu.obs, 0, NULL, NULL, &clErr);

#ifdef BENCHMARK

    t = clock() - t;

    time_taken = ((double)t);

    printf("clEnqueueMapBuffert%dn", (int)time_taken);

    t = clock();

#endif

    if (clErr != CL_SUCCESS)

        printf("could not map array to device. error: %sn", clCheckError(clErr));

    memcpy(x, map, (n - x_gpu.off) * x_gpu.obs);

#ifdef BENCHMARK

    t = clock() - t;

    time_taken = ((double)t);

    printf("memcpyt%dn", (int)time_taken);

    t = clock();

#endif

    clErr = clEnqueueUnmapMemObject(opencl_queues[opencl_device_id_t], x_gpu.mem, map, 0, NULL, NULL);

    if (clErr != CL_SUCCESS)

        printf("could not unmap array from device. error: %sn", clCheckError(clErr));

#ifdef BENCHMARK

    t = clock() - t;

    time_taken = ((double)t);

    printf("clEnqueueUnmapMemObjectt%dn", (int)time_taken);

#endif

What could be the reason of such delay during first call? How to decrease the delay?

edited Nov 12 '18 at 18:16

Alon

54028

asked Nov 12 '18 at 18:11

StasK

I have a performance issue in YOLO adoption for OpenCL code.

The method, which just pulls data from device, works slow first time and fast next several calls. There is log with calls, time in microseconds:

clEnqueueMapBuffer      144469

memcpy  2

clEnqueueUnmapMemObject 31

clEnqueueMapBuffer      466

memcpy  103

clEnqueueUnmapMemObject 14

clEnqueueMapBuffer      468

memcpy  106

clEnqueueUnmapMemObject 17

First call is with 1 byte copy (where memcpy takes 2 microseconds).

The memory is allocated by code:

if (!x)

    x = (float*) calloc(n, sizeof(float));



buf.ptr = x;



cl_int clErr;

buf.org = clCreateBuffer(opencl_context, CL_MEM_READ_WRITE | 

CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, buf.len * buf.obs, buf.ptr, &clErr);

The code for pulling data is next:

#ifdef BENCHMARK

    clock_t t;

    double time_taken;

    t = clock();

#endif

    cl_int clErr;

    void* map = clEnqueueMapBuffer(opencl_queues[opencl_device_id_t], x_gpu.mem, CL_TRUE, CL_MAP_READ,

                                   0, x_gpu.len * x_gpu.obs, 0, NULL, NULL, &clErr);

#ifdef BENCHMARK

    t = clock() - t;

    time_taken = ((double)t);

    printf("clEnqueueMapBuffert%dn", (int)time_taken);

    t = clock();

#endif

    if (clErr != CL_SUCCESS)

        printf("could not map array to device. error: %sn", clCheckError(clErr));

    memcpy(x, map, (n - x_gpu.off) * x_gpu.obs);

#ifdef BENCHMARK

    t = clock() - t;

    time_taken = ((double)t);

    printf("memcpyt%dn", (int)time_taken);

    t = clock();

#endif

    clErr = clEnqueueUnmapMemObject(opencl_queues[opencl_device_id_t], x_gpu.mem, map, 0, NULL, NULL);

    if (clErr != CL_SUCCESS)

        printf("could not unmap array from device. error: %sn", clCheckError(clErr));

#ifdef BENCHMARK

    t = clock() - t;

    time_taken = ((double)t);

    printf("clEnqueueUnmapMemObjectt%dn", (int)time_taken);

#endif

What could be the reason of such delay during first call? How to decrease the delay?

c performance memory-management gpu opencl

edited Nov 12 '18 at 18:16

Alon

54028

asked Nov 12 '18 at 18:11

StasK

edited Nov 12 '18 at 18:16

Alon

54028

asked Nov 12 '18 at 18:11

StasK

edited Nov 12 '18 at 18:16

Alon

54028

edited Nov 12 '18 at 18:16

Alon

54028

edited Nov 12 '18 at 18:16

Alon

54028

asked Nov 12 '18 at 18:11

StasK

asked Nov 12 '18 at 18:11

StasK

asked Nov 12 '18 at 18:11

StasK

Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call clEnqueueMapBuffer()? If unsure, try inserting clFinish() before the mapping. (Generally, the clFinish() is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
– pmdj
Nov 12 '18 at 18:35

The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the CL_MEM_ALLOC_HOST_PTR flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
– pmdj
Nov 12 '18 at 18:39

You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
– StasK
Nov 12 '18 at 18:42

Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
– pmdj
Nov 12 '18 at 20:42

add a comment |

Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call clEnqueueMapBuffer()? If unsure, try inserting clFinish() before the mapping. (Generally, the clFinish() is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
– pmdj
Nov 12 '18 at 18:35

The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the CL_MEM_ALLOC_HOST_PTR flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
– pmdj
Nov 12 '18 at 18:39

You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
– StasK
Nov 12 '18 at 18:42

Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
– pmdj
Nov 12 '18 at 20:42

Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call clEnqueueMapBuffer()? If unsure, try inserting clFinish() before the mapping. (Generally, the clFinish() is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
– pmdj
Nov 12 '18 at 18:35

The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the CL_MEM_ALLOC_HOST_PTR flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
– pmdj
Nov 12 '18 at 18:39

You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
– StasK
Nov 12 '18 at 18:42

Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
– pmdj
Nov 12 '18 at 20:42

add a comment |

1 Answer
1

active

oldest

votes

Your clEnqueueMapBuffer() call is blocking (CL_TRUE for the blocking_map parameter) which means that the call will only return once the mapping operation has completed. If your command queue is not concurrent, any previously queued, asynchronous commands such as enqueued kernels, will need to complete before the mapping can even begin. If there are such earlier commands, you are actually measuring their completion as well as the memory-mapping operation. To avoid this, add a clFinish() call before starting your clock. (It is possibly slightly more efficient not to call clFinish(), so I recommend you only leave it in for measurement purposes.)

answered Nov 12 '18 at 20:41

pmdj

12.7k13284

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267812%2ffirst-clenqueuemapbuffer-call-takes-much-time%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

answered Nov 12 '18 at 20:41

pmdj

12.7k13284

add a comment |

answered Nov 12 '18 at 20:41

pmdj

12.7k13284

add a comment |

answered Nov 12 '18 at 20:41

pmdj

12.7k13284

answered Nov 12 '18 at 20:41

pmdj

12.7k13284

answered Nov 12 '18 at 20:41

pmdj

12.7k13284

answered Nov 12 '18 at 20:41

pmdj

12.7k13284

answered Nov 12 '18 at 20:41

pmdj

12.7k13284

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ndtyjky