First clEnqueueMapBuffer call takes much time
I have a performance issue in YOLO adoption for OpenCL code.
The method, which just pulls data from device, works slow first time and fast next several calls. There is log with calls, time in microseconds:
clEnqueueMapBuffer 144469
memcpy 2
clEnqueueUnmapMemObject 31
clEnqueueMapBuffer 466
memcpy 103
clEnqueueUnmapMemObject 14
clEnqueueMapBuffer 468
memcpy 106
clEnqueueUnmapMemObject 17
First call is with 1 byte copy (where memcpy takes 2 microseconds).
The memory is allocated by code:
if (!x)
x = (float*) calloc(n, sizeof(float));
buf.ptr = x;
cl_int clErr;
buf.org = clCreateBuffer(opencl_context, CL_MEM_READ_WRITE |
CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, buf.len * buf.obs, buf.ptr, &clErr);
The code for pulling data is next:
#ifdef BENCHMARK
clock_t t;
double time_taken;
t = clock();
#endif
cl_int clErr;
void* map = clEnqueueMapBuffer(opencl_queues[opencl_device_id_t], x_gpu.mem, CL_TRUE, CL_MAP_READ,
0, x_gpu.len * x_gpu.obs, 0, NULL, NULL, &clErr);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueMapBuffert%dn", (int)time_taken);
t = clock();
#endif
if (clErr != CL_SUCCESS)
printf("could not map array to device. error: %sn", clCheckError(clErr));
memcpy(x, map, (n - x_gpu.off) * x_gpu.obs);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("memcpyt%dn", (int)time_taken);
t = clock();
#endif
clErr = clEnqueueUnmapMemObject(opencl_queues[opencl_device_id_t], x_gpu.mem, map, 0, NULL, NULL);
if (clErr != CL_SUCCESS)
printf("could not unmap array from device. error: %sn", clCheckError(clErr));
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueUnmapMemObjectt%dn", (int)time_taken);
#endif
What could be the reason of such delay during first call? How to decrease the delay?
c performance memory-management gpu opencl
add a comment |
I have a performance issue in YOLO adoption for OpenCL code.
The method, which just pulls data from device, works slow first time and fast next several calls. There is log with calls, time in microseconds:
clEnqueueMapBuffer 144469
memcpy 2
clEnqueueUnmapMemObject 31
clEnqueueMapBuffer 466
memcpy 103
clEnqueueUnmapMemObject 14
clEnqueueMapBuffer 468
memcpy 106
clEnqueueUnmapMemObject 17
First call is with 1 byte copy (where memcpy takes 2 microseconds).
The memory is allocated by code:
if (!x)
x = (float*) calloc(n, sizeof(float));
buf.ptr = x;
cl_int clErr;
buf.org = clCreateBuffer(opencl_context, CL_MEM_READ_WRITE |
CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, buf.len * buf.obs, buf.ptr, &clErr);
The code for pulling data is next:
#ifdef BENCHMARK
clock_t t;
double time_taken;
t = clock();
#endif
cl_int clErr;
void* map = clEnqueueMapBuffer(opencl_queues[opencl_device_id_t], x_gpu.mem, CL_TRUE, CL_MAP_READ,
0, x_gpu.len * x_gpu.obs, 0, NULL, NULL, &clErr);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueMapBuffert%dn", (int)time_taken);
t = clock();
#endif
if (clErr != CL_SUCCESS)
printf("could not map array to device. error: %sn", clCheckError(clErr));
memcpy(x, map, (n - x_gpu.off) * x_gpu.obs);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("memcpyt%dn", (int)time_taken);
t = clock();
#endif
clErr = clEnqueueUnmapMemObject(opencl_queues[opencl_device_id_t], x_gpu.mem, map, 0, NULL, NULL);
if (clErr != CL_SUCCESS)
printf("could not unmap array from device. error: %sn", clCheckError(clErr));
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueUnmapMemObjectt%dn", (int)time_taken);
#endif
What could be the reason of such delay during first call? How to decrease the delay?
c performance memory-management gpu opencl
Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you callclEnqueueMapBuffer()
? If unsure, try insertingclFinish()
before the mapping. (Generally, theclFinish()
is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
– pmdj
Nov 12 '18 at 18:35
The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing theCL_MEM_ALLOC_HOST_PTR
flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
– pmdj
Nov 12 '18 at 18:39
You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
– StasK
Nov 12 '18 at 18:42
Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
– pmdj
Nov 12 '18 at 20:42
add a comment |
I have a performance issue in YOLO adoption for OpenCL code.
The method, which just pulls data from device, works slow first time and fast next several calls. There is log with calls, time in microseconds:
clEnqueueMapBuffer 144469
memcpy 2
clEnqueueUnmapMemObject 31
clEnqueueMapBuffer 466
memcpy 103
clEnqueueUnmapMemObject 14
clEnqueueMapBuffer 468
memcpy 106
clEnqueueUnmapMemObject 17
First call is with 1 byte copy (where memcpy takes 2 microseconds).
The memory is allocated by code:
if (!x)
x = (float*) calloc(n, sizeof(float));
buf.ptr = x;
cl_int clErr;
buf.org = clCreateBuffer(opencl_context, CL_MEM_READ_WRITE |
CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, buf.len * buf.obs, buf.ptr, &clErr);
The code for pulling data is next:
#ifdef BENCHMARK
clock_t t;
double time_taken;
t = clock();
#endif
cl_int clErr;
void* map = clEnqueueMapBuffer(opencl_queues[opencl_device_id_t], x_gpu.mem, CL_TRUE, CL_MAP_READ,
0, x_gpu.len * x_gpu.obs, 0, NULL, NULL, &clErr);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueMapBuffert%dn", (int)time_taken);
t = clock();
#endif
if (clErr != CL_SUCCESS)
printf("could not map array to device. error: %sn", clCheckError(clErr));
memcpy(x, map, (n - x_gpu.off) * x_gpu.obs);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("memcpyt%dn", (int)time_taken);
t = clock();
#endif
clErr = clEnqueueUnmapMemObject(opencl_queues[opencl_device_id_t], x_gpu.mem, map, 0, NULL, NULL);
if (clErr != CL_SUCCESS)
printf("could not unmap array from device. error: %sn", clCheckError(clErr));
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueUnmapMemObjectt%dn", (int)time_taken);
#endif
What could be the reason of such delay during first call? How to decrease the delay?
c performance memory-management gpu opencl
I have a performance issue in YOLO adoption for OpenCL code.
The method, which just pulls data from device, works slow first time and fast next several calls. There is log with calls, time in microseconds:
clEnqueueMapBuffer 144469
memcpy 2
clEnqueueUnmapMemObject 31
clEnqueueMapBuffer 466
memcpy 103
clEnqueueUnmapMemObject 14
clEnqueueMapBuffer 468
memcpy 106
clEnqueueUnmapMemObject 17
First call is with 1 byte copy (where memcpy takes 2 microseconds).
The memory is allocated by code:
if (!x)
x = (float*) calloc(n, sizeof(float));
buf.ptr = x;
cl_int clErr;
buf.org = clCreateBuffer(opencl_context, CL_MEM_READ_WRITE |
CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR, buf.len * buf.obs, buf.ptr, &clErr);
The code for pulling data is next:
#ifdef BENCHMARK
clock_t t;
double time_taken;
t = clock();
#endif
cl_int clErr;
void* map = clEnqueueMapBuffer(opencl_queues[opencl_device_id_t], x_gpu.mem, CL_TRUE, CL_MAP_READ,
0, x_gpu.len * x_gpu.obs, 0, NULL, NULL, &clErr);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueMapBuffert%dn", (int)time_taken);
t = clock();
#endif
if (clErr != CL_SUCCESS)
printf("could not map array to device. error: %sn", clCheckError(clErr));
memcpy(x, map, (n - x_gpu.off) * x_gpu.obs);
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("memcpyt%dn", (int)time_taken);
t = clock();
#endif
clErr = clEnqueueUnmapMemObject(opencl_queues[opencl_device_id_t], x_gpu.mem, map, 0, NULL, NULL);
if (clErr != CL_SUCCESS)
printf("could not unmap array from device. error: %sn", clCheckError(clErr));
#ifdef BENCHMARK
t = clock() - t;
time_taken = ((double)t);
printf("clEnqueueUnmapMemObjectt%dn", (int)time_taken);
#endif
What could be the reason of such delay during first call? How to decrease the delay?
c performance memory-management gpu opencl
c performance memory-management gpu opencl
edited Nov 12 '18 at 18:16
Alon
54028
54028
asked Nov 12 '18 at 18:11
StasK
85
85
Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you callclEnqueueMapBuffer()
? If unsure, try insertingclFinish()
before the mapping. (Generally, theclFinish()
is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
– pmdj
Nov 12 '18 at 18:35
The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing theCL_MEM_ALLOC_HOST_PTR
flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
– pmdj
Nov 12 '18 at 18:39
You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
– StasK
Nov 12 '18 at 18:42
Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
– pmdj
Nov 12 '18 at 20:42
add a comment |
Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you callclEnqueueMapBuffer()
? If unsure, try insertingclFinish()
before the mapping. (Generally, theclFinish()
is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)
– pmdj
Nov 12 '18 at 18:35
The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing theCL_MEM_ALLOC_HOST_PTR
flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)
– pmdj
Nov 12 '18 at 18:39
You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
– StasK
Nov 12 '18 at 18:42
Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
– pmdj
Nov 12 '18 at 20:42
Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call
clEnqueueMapBuffer()
? If unsure, try inserting clFinish()
before the mapping. (Generally, the clFinish()
is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)– pmdj
Nov 12 '18 at 18:35
Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call
clEnqueueMapBuffer()
? If unsure, try inserting clFinish()
before the mapping. (Generally, the clFinish()
is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)– pmdj
Nov 12 '18 at 18:35
The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the
CL_MEM_ALLOC_HOST_PTR
flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)– pmdj
Nov 12 '18 at 18:39
The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the
CL_MEM_ALLOC_HOST_PTR
flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)– pmdj
Nov 12 '18 at 18:39
You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
– StasK
Nov 12 '18 at 18:42
You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
– StasK
Nov 12 '18 at 18:42
Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
– pmdj
Nov 12 '18 at 20:42
Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
– pmdj
Nov 12 '18 at 20:42
add a comment |
1 Answer
1
active
oldest
votes
Your clEnqueueMapBuffer()
call is blocking (CL_TRUE
for the blocking_map
parameter) which means that the call will only return once the mapping operation has completed. If your command queue is not concurrent, any previously queued, asynchronous commands such as enqueued kernels, will need to complete before the mapping can even begin. If there are such earlier commands, you are actually measuring their completion as well as the memory-mapping operation. To avoid this, add a clFinish()
call before starting your clock. (It is possibly slightly more efficient not to call clFinish()
, so I recommend you only leave it in for measurement purposes.)
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267812%2ffirst-clenqueuemapbuffer-call-takes-much-time%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Your clEnqueueMapBuffer()
call is blocking (CL_TRUE
for the blocking_map
parameter) which means that the call will only return once the mapping operation has completed. If your command queue is not concurrent, any previously queued, asynchronous commands such as enqueued kernels, will need to complete before the mapping can even begin. If there are such earlier commands, you are actually measuring their completion as well as the memory-mapping operation. To avoid this, add a clFinish()
call before starting your clock. (It is possibly slightly more efficient not to call clFinish()
, so I recommend you only leave it in for measurement purposes.)
add a comment |
Your clEnqueueMapBuffer()
call is blocking (CL_TRUE
for the blocking_map
parameter) which means that the call will only return once the mapping operation has completed. If your command queue is not concurrent, any previously queued, asynchronous commands such as enqueued kernels, will need to complete before the mapping can even begin. If there are such earlier commands, you are actually measuring their completion as well as the memory-mapping operation. To avoid this, add a clFinish()
call before starting your clock. (It is possibly slightly more efficient not to call clFinish()
, so I recommend you only leave it in for measurement purposes.)
add a comment |
Your clEnqueueMapBuffer()
call is blocking (CL_TRUE
for the blocking_map
parameter) which means that the call will only return once the mapping operation has completed. If your command queue is not concurrent, any previously queued, asynchronous commands such as enqueued kernels, will need to complete before the mapping can even begin. If there are such earlier commands, you are actually measuring their completion as well as the memory-mapping operation. To avoid this, add a clFinish()
call before starting your clock. (It is possibly slightly more efficient not to call clFinish()
, so I recommend you only leave it in for measurement purposes.)
Your clEnqueueMapBuffer()
call is blocking (CL_TRUE
for the blocking_map
parameter) which means that the call will only return once the mapping operation has completed. If your command queue is not concurrent, any previously queued, asynchronous commands such as enqueued kernels, will need to complete before the mapping can even begin. If there are such earlier commands, you are actually measuring their completion as well as the memory-mapping operation. To avoid this, add a clFinish()
call before starting your clock. (It is possibly slightly more efficient not to call clFinish()
, so I recommend you only leave it in for measurement purposes.)
answered Nov 12 '18 at 20:41
pmdj
12.7k13284
12.7k13284
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53267812%2ffirst-clenqueuemapbuffer-call-takes-much-time%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Are you certain any previous operations on the command queue, such as enqueued kernels, have finished by the time you call
clEnqueueMapBuffer()
? If unsure, try insertingclFinish()
before the mapping. (Generally, theclFinish()
is not needed and not having it there is potentially faster, but asynchronously queued commands will cause misleading time measurements without it.)– pmdj
Nov 12 '18 at 18:35
The other possibility is simply that the implementation is copying the data from GPU VRAM to host memory, and this will take time. In this case, you may wish to experiment with removing the
CL_MEM_ALLOC_HOST_PTR
flag on buffer creation, and/or to use clEnqueueReadBuffer() to get the GPU's DMA controller to perform the copy. (If you're just using memcpy to copy all the data out, there's not really much point in using the mapping API - the idea of that is to allow direct, zero-copy access.)– pmdj
Nov 12 '18 at 18:39
You are right. There were unfinished operations. Calling clFinish() takes all time: clFinish 148326 clEnqueueMapBuffer 155
– StasK
Nov 12 '18 at 18:42
Cool, I've posted an answer explaining this in a little more detail. Hope that all makes sense!
– pmdj
Nov 12 '18 at 20:42