What is the ideal bulk size formula in ElasticSearch?

I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.

Number of nodes

Number of shards/index

Document size

Disk write speed

LAN speed

I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?

edited May 15 '16 at 18:05

Laurel

4,758102136

asked Aug 28 '13 at 13:03

shyos

1,04211327

add a comment |

I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.

Number of nodes

Number of shards/index

Document size

Disk write speed

LAN speed

I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?

edited May 15 '16 at 18:05

Laurel

4,758102136

asked Aug 28 '13 at 13:03

shyos

1,04211327

add a comment |

I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.

Number of nodes

Number of shards/index

Document size

Disk write speed

LAN speed

I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?

edited May 15 '16 at 18:05

Laurel

4,758102136

asked Aug 28 '13 at 13:03

shyos

1,04211327

I believe there should be a formula to calculate bulk indexing size in ElasticSearch. Probably followings are the variables of such a formula.

Number of nodes

Number of shards/index

Document size

Disk write speed

LAN speed

I wonder If anyone know or use a mathematical formula. If not, how people decide their bulk size? By trial and error?

elasticsearch elasticsearch-bulk-api

edited May 15 '16 at 18:05

Laurel

4,758102136

asked Aug 28 '13 at 13:03

shyos

1,04211327

edited May 15 '16 at 18:05

Laurel

4,758102136

asked Aug 28 '13 at 13:03

shyos

1,04211327

edited May 15 '16 at 18:05

Laurel

4,758102136

edited May 15 '16 at 18:05

Laurel

4,758102136

edited May 15 '16 at 18:05

Laurel

4,758102136

asked Aug 28 '13 at 13:03

shyos

1,04211327

asked Aug 28 '13 at 13:03

shyos

1,04211327

asked Aug 28 '13 at 13:03

shyos

1,04211327

add a comment |

5 Answers
5

active

oldest

votes

There is no golden rule for this. Extracted from the doc:

There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.

answered Aug 28 '13 at 13:57

moliware

7,89432642

2

Ultimately, one does need to tune. But is there some idea of what order of magnitude? Are we talking 10s / 100s / 1000s? Any starter suggestions to go by?
– Dilum Ranatunga
Oct 15 '13 at 15:45

1

I usually use a bulk size between 1K and 5K docs.
– moliware
Oct 16 '13 at 8:14

add a comment |

I derived this information from the Java API's BulkProcessor class. It defaults to 1000 actions or 5MB, it also allows you to set a flush interval but this is not set by default. I'm just using the default settings.

I'd suggest using BulkProcessor if you are using the Java API.

answered Nov 25 '13 at 15:05

hudsonb

1,7051619

That sounds a bit conservative, I've run indexing jobs via the http api with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServer
– jmng
Nov 12 '18 at 22:23

1

It's very conservative. However, you can't determine the ideal settings w/o testing with actual data on the actual cluster. These days (5 years later) we have a much larger and more powerful cluster using MUCH larger batch sizes in MBs with no document limit.
– hudsonb
Nov 13 '18 at 1:03

add a comment |

I was searching about it and i found your question :)
i found this in elastic documentation
.. so i will investigate the size of my documents.

It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size

answered Mar 28 '16 at 9:55

HADEEL

13827

That sounds a bit conservative (probably the intention), I run indexing jobs with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServer (more below).
– jmng
Nov 12 '18 at 22:21

add a comment |

Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests

Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy

Use bulk size in KiB (or equivalent), not document count !

Send data in bulk (no streaming), pass redundant info API url if you can

Remove superfluous whitespace in your data if possible

Disable search index updates, activate it back later

Round-robin across all your data nodes

answered Nov 8 '16 at 10:34

Christophe Roussy

9,15615457

add a comment |

I haven't found a better way than trial and error (i.e. the traditional engineering process), as there are many factors beyond hardware influencing indexing speed: the structure/complexity of your index (complex mappings, filters or analyzers), data types, whether your workload is I/O or CPU bound, and so on.

In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:

Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.

I can successfully index documents of wildly varying sizes via the http bulk api (curl) using a batch size of 10k documents (20k lines, file sizes between 25MB and 79MB), each batch taking ~90 seconds. index.refresh_interval is set to -1 during indexing, but that's about the only "tuning" I did, all other configurations are the default. I guess this is mostly due to the fact that the index itself is not too complex.

The vServer is at about 50% CPU, SSD averaging at 40 MB/s and 4GB RAM free, so I could probably make it faster by sending two files in parallel (I've tried simply increasing the batch size by 50% but started getting errors), but after that point it probably makes more sense to consider a different API or simply spreading the load over a cluster.

edited Nov 12 '18 at 22:24

answered Nov 12 '18 at 22:15

jmng

887823

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f18488747%2fwhat-is-the-ideal-bulk-size-formula-in-elasticsearch%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

5 Answers
5

active

oldest

votes

5 Answers
5

active

oldest

votes

There is no golden rule for this. Extracted from the doc:

There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.

answered Aug 28 '13 at 13:57

moliware

7,89432642

2

Ultimately, one does need to tune. But is there some idea of what order of magnitude? Are we talking 10s / 100s / 1000s? Any starter suggestions to go by?
– Dilum Ranatunga
Oct 15 '13 at 15:45

1

I usually use a bulk size between 1K and 5K docs.
– moliware
Oct 16 '13 at 8:14

add a comment |

There is no golden rule for this. Extracted from the doc:

There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.

answered Aug 28 '13 at 13:57

moliware

7,89432642

2

Ultimately, one does need to tune. But is there some idea of what order of magnitude? Are we talking 10s / 100s / 1000s? Any starter suggestions to go by?
– Dilum Ranatunga
Oct 15 '13 at 15:45

1

I usually use a bulk size between 1K and 5K docs.
– moliware
Oct 16 '13 at 8:14

add a comment |

There is no golden rule for this. Extracted from the doc:

There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.

answered Aug 28 '13 at 13:57

moliware

7,89432642

There is no golden rule for this. Extracted from the doc:

There is no “correct” number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.

answered Aug 28 '13 at 13:57

moliware

7,89432642

answered Aug 28 '13 at 13:57

moliware

7,89432642

answered Aug 28 '13 at 13:57

moliware

7,89432642

answered Aug 28 '13 at 13:57

moliware

7,89432642

2

Ultimately, one does need to tune. But is there some idea of what order of magnitude? Are we talking 10s / 100s / 1000s? Any starter suggestions to go by?
– Dilum Ranatunga
Oct 15 '13 at 15:45

1

I usually use a bulk size between 1K and 5K docs.
– moliware
Oct 16 '13 at 8:14

add a comment |

2

Ultimately, one does need to tune. But is there some idea of what order of magnitude? Are we talking 10s / 100s / 1000s? Any starter suggestions to go by?
– Dilum Ranatunga
Oct 15 '13 at 15:45

1

I usually use a bulk size between 1K and 5K docs.
– moliware
Oct 16 '13 at 8:14

Ultimately, one does need to tune. But is there some idea of what order of magnitude? Are we talking 10s / 100s / 1000s? Any starter suggestions to go by?
– Dilum Ranatunga
Oct 15 '13 at 15:45

I usually use a bulk size between 1K and 5K docs.
– moliware
Oct 16 '13 at 8:14

add a comment |

I'd suggest using BulkProcessor if you are using the Java API.

answered Nov 25 '13 at 15:05

hudsonb

1,7051619

That sounds a bit conservative, I've run indexing jobs via the http api with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServer
– jmng
Nov 12 '18 at 22:23

1

It's very conservative. However, you can't determine the ideal settings w/o testing with actual data on the actual cluster. These days (5 years later) we have a much larger and more powerful cluster using MUCH larger batch sizes in MBs with no document limit.
– hudsonb
Nov 13 '18 at 1:03

add a comment |

I'd suggest using BulkProcessor if you are using the Java API.

answered Nov 25 '13 at 15:05

hudsonb

1,7051619

That sounds a bit conservative, I've run indexing jobs via the http api with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServer
– jmng
Nov 12 '18 at 22:23

1

It's very conservative. However, you can't determine the ideal settings w/o testing with actual data on the actual cluster. These days (5 years later) we have a much larger and more powerful cluster using MUCH larger batch sizes in MBs with no document limit.
– hudsonb
Nov 13 '18 at 1:03

add a comment |

I'd suggest using BulkProcessor if you are using the Java API.

answered Nov 25 '13 at 15:05

hudsonb

1,7051619

I'd suggest using BulkProcessor if you are using the Java API.

answered Nov 25 '13 at 15:05

hudsonb

1,7051619

answered Nov 25 '13 at 15:05

hudsonb

1,7051619

answered Nov 25 '13 at 15:05

hudsonb

1,7051619

answered Nov 25 '13 at 15:05

hudsonb

1,7051619

That sounds a bit conservative, I've run indexing jobs via the http api with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServer
– jmng
Nov 12 '18 at 22:23

1

It's very conservative. However, you can't determine the ideal settings w/o testing with actual data on the actual cluster. These days (5 years later) we have a much larger and more powerful cluster using MUCH larger batch sizes in MBs with no document limit.
– hudsonb
Nov 13 '18 at 1:03

add a comment |

That sounds a bit conservative, I've run indexing jobs via the http api with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServer
– jmng
Nov 12 '18 at 22:23

1

It's very conservative. However, you can't determine the ideal settings w/o testing with actual data on the actual cluster. These days (5 years later) we have a much larger and more powerful cluster using MUCH larger batch sizes in MBs with no document limit.
– hudsonb
Nov 13 '18 at 1:03

That sounds a bit conservative, I've run indexing jobs via the http api with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServer
– jmng
Nov 12 '18 at 22:23

It's very conservative. However, you can't determine the ideal settings w/o testing with actual data on the actual cluster. These days (5 years later) we have a much larger and more powerful cluster using MUCH larger batch sizes in MBs with no document limit.
– hudsonb
Nov 13 '18 at 1:03

add a comment |

I was searching about it and i found your question :)
i found this in elastic documentation
.. so i will investigate the size of my documents.

It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size

answered Mar 28 '16 at 9:55

HADEEL

13827

That sounds a bit conservative (probably the intention), I run indexing jobs with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServer (more below).
– jmng
Nov 12 '18 at 22:21

add a comment |

I was searching about it and i found your question :)
i found this in elastic documentation
.. so i will investigate the size of my documents.

It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size

answered Mar 28 '16 at 9:55

HADEEL

13827

That sounds a bit conservative (probably the intention), I run indexing jobs with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServer (more below).
– jmng
Nov 12 '18 at 22:21

add a comment |

I was searching about it and i found your question :)
i found this in elastic documentation
.. so i will investigate the size of my documents.

It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size

answered Mar 28 '16 at 9:55

HADEEL

13827

I was searching about it and i found your question :)
i found this in elastic documentation
.. so i will investigate the size of my documents.

It is often useful to keep an eye on the physical size of your bulk requests. One thousand 1KB documents is very different from one thousand 1MB documents. A good bulk size to start playing with is around 5-15MB in size

answered Mar 28 '16 at 9:55

HADEEL

13827

answered Mar 28 '16 at 9:55

HADEEL

13827

answered Mar 28 '16 at 9:55

HADEEL

13827

answered Mar 28 '16 at 9:55

HADEEL

13827

That sounds a bit conservative (probably the intention), I run indexing jobs with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServer (more below).
– jmng
Nov 12 '18 at 22:21

add a comment |

That sounds a bit conservative (probably the intention), I run indexing jobs with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServer (more below).
– jmng
Nov 12 '18 at 22:21

That sounds a bit conservative (probably the intention), I run indexing jobs with batch sizes of 10k documents (files between ~25M and ~80MB) on a modest vServer (more below).
– jmng
Nov 12 '18 at 22:21

add a comment |

Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests

Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy

Use bulk size in KiB (or equivalent), not document count !

Send data in bulk (no streaming), pass redundant info API url if you can

Remove superfluous whitespace in your data if possible

Disable search index updates, activate it back later

Round-robin across all your data nodes

answered Nov 8 '16 at 10:34

Christophe Roussy

9,15615457

add a comment |

Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests

Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy

Use bulk size in KiB (or equivalent), not document count !

Send data in bulk (no streaming), pass redundant info API url if you can

Remove superfluous whitespace in your data if possible

Disable search index updates, activate it back later

Round-robin across all your data nodes

answered Nov 8 '16 at 10:34

Christophe Roussy

9,15615457

add a comment |

Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests

Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy

Use bulk size in KiB (or equivalent), not document count !

Send data in bulk (no streaming), pass redundant info API url if you can

Remove superfluous whitespace in your data if possible

Disable search index updates, activate it back later

Round-robin across all your data nodes

answered Nov 8 '16 at 10:34

Christophe Roussy

9,15615457

Read ES bulk API doc carefully: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#_using_and_sizing_bulk_requests

Try with 1 KiB, try with 20 KiB, then with 10 KiB, ... dichotomy

Use bulk size in KiB (or equivalent), not document count !

Send data in bulk (no streaming), pass redundant info API url if you can

Remove superfluous whitespace in your data if possible

Disable search index updates, activate it back later

Round-robin across all your data nodes

answered Nov 8 '16 at 10:34

Christophe Roussy

9,15615457

answered Nov 8 '16 at 10:34

Christophe Roussy

9,15615457

answered Nov 8 '16 at 10:34

Christophe Roussy

9,15615457

answered Nov 8 '16 at 10:34

Christophe Roussy

9,15615457

add a comment |

In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:

Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.

edited Nov 12 '18 at 22:24

answered Nov 12 '18 at 22:15

jmng

887823

add a comment |

In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:

Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.

edited Nov 12 '18 at 22:24

answered Nov 12 '18 at 22:15

jmng

887823

add a comment |

In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:

Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.

edited Nov 12 '18 at 22:24

answered Nov 12 '18 at 22:15

jmng

887823

In any case, to demonstrate how variable it can be, I can share my experience, as it seems different from most posted here:

Elastic 5.6 with 10GB heap running on a single vServer with 16GB RAM, 4 vCPU and an SSD that averages 150 MB/s while searching.

edited Nov 12 '18 at 22:24

answered Nov 12 '18 at 22:15

jmng

887823

edited Nov 12 '18 at 22:24

answered Nov 12 '18 at 22:15

jmng

887823

answered Nov 12 '18 at 22:15

jmng

887823

answered Nov 12 '18 at 22:15

jmng

887823

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

WqfGk2DCEIOe9t,lA3jGGHVkE nnjKNR99bO7 6N4kSW UASj ZHXs UgnYc xQpUBolucVIwF,A yFt69,Xs h

搜尋此網誌

Ndtyjky