Solr server keeps going down while indexing (millions of docs) using Pysolr
I've been trying to index a lot of documents on Solr (~200 million docs). I use Pysolr to do the indexing. However, the Solr server keeps going down while indexing (sometimes after 100 million documents have been indexed, sometimes after ~180 million documents, it differs).
I'm not sure why this is happening, is it because of the open size limit, i.e., related to the warning I get while starting the server with bin/solr start?
* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.
I used multiprocessing while indexing with chunks of 25000 (but I also tried with bigger chunks and without multiprocessing and it still crashed). Is it because there are too many requests being sent to Solr? My Python code is below.
solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)
def insert_into_solr(filepath):
""" Inserts records into an empty solr index which has already been created."""
record_number = 0
list_for_solr =
with open(filepath, "r") as file:
csv_reader = csv.reader((line.replace('', '') for line in file), delimiter='t', quoting=csv.QUOTE_NONE)
for paper_id, paper_reference_id, context in csv_reader:
# int, int, string
record_number += 1
solr_record = {}
solr_record['paper_id'] = paper_id
solr_record['reference_id'] = reference_id
solr_record['context'] = context
# Chunks of 25000
if record_number % 25000 == 0:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
list_for_solr =
print(record_number)
else:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
def create_concurrent_futures():
""" Uses all the cores to do the parsing and inserting"""
folderpath = '.../'
refs_files = glob(os.path.join(folderpath, '*.txt'))
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(insert_into_solr, refs_files, chunksize=1)
if __name__ == '__main__':
create_concurrent_futures()
I read somewhere that the standard Solr installation has a hard limit of around 2.14 billion documents. Is it better to use Solrcloud (which I have never configured) when there are 100s of millions of docs? Will it help with this problem? (I also have another file with 1.4 Billion documents which needs to be indexed after this). I have only one server, is there any point trying to configure Solrcloud?
python ubuntu unix solr pysolr
add a comment |
I've been trying to index a lot of documents on Solr (~200 million docs). I use Pysolr to do the indexing. However, the Solr server keeps going down while indexing (sometimes after 100 million documents have been indexed, sometimes after ~180 million documents, it differs).
I'm not sure why this is happening, is it because of the open size limit, i.e., related to the warning I get while starting the server with bin/solr start?
* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.
I used multiprocessing while indexing with chunks of 25000 (but I also tried with bigger chunks and without multiprocessing and it still crashed). Is it because there are too many requests being sent to Solr? My Python code is below.
solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)
def insert_into_solr(filepath):
""" Inserts records into an empty solr index which has already been created."""
record_number = 0
list_for_solr =
with open(filepath, "r") as file:
csv_reader = csv.reader((line.replace('', '') for line in file), delimiter='t', quoting=csv.QUOTE_NONE)
for paper_id, paper_reference_id, context in csv_reader:
# int, int, string
record_number += 1
solr_record = {}
solr_record['paper_id'] = paper_id
solr_record['reference_id'] = reference_id
solr_record['context'] = context
# Chunks of 25000
if record_number % 25000 == 0:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
list_for_solr =
print(record_number)
else:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
def create_concurrent_futures():
""" Uses all the cores to do the parsing and inserting"""
folderpath = '.../'
refs_files = glob(os.path.join(folderpath, '*.txt'))
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(insert_into_solr, refs_files, chunksize=1)
if __name__ == '__main__':
create_concurrent_futures()
I read somewhere that the standard Solr installation has a hard limit of around 2.14 billion documents. Is it better to use Solrcloud (which I have never configured) when there are 100s of millions of docs? Will it help with this problem? (I also have another file with 1.4 Billion documents which needs to be indexed after this). I have only one server, is there any point trying to configure Solrcloud?
python ubuntu unix solr pysolr
1
An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.
– MatsLindh
Nov 15 '18 at 20:20
Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?
– ash
Nov 15 '18 at 21:10
1
That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.
– MatsLindh
Nov 16 '18 at 9:14
Thanks @MatsLindh for the advice. That's very helpful.
– ash
Nov 17 '18 at 2:40
add a comment |
I've been trying to index a lot of documents on Solr (~200 million docs). I use Pysolr to do the indexing. However, the Solr server keeps going down while indexing (sometimes after 100 million documents have been indexed, sometimes after ~180 million documents, it differs).
I'm not sure why this is happening, is it because of the open size limit, i.e., related to the warning I get while starting the server with bin/solr start?
* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.
I used multiprocessing while indexing with chunks of 25000 (but I also tried with bigger chunks and without multiprocessing and it still crashed). Is it because there are too many requests being sent to Solr? My Python code is below.
solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)
def insert_into_solr(filepath):
""" Inserts records into an empty solr index which has already been created."""
record_number = 0
list_for_solr =
with open(filepath, "r") as file:
csv_reader = csv.reader((line.replace('', '') for line in file), delimiter='t', quoting=csv.QUOTE_NONE)
for paper_id, paper_reference_id, context in csv_reader:
# int, int, string
record_number += 1
solr_record = {}
solr_record['paper_id'] = paper_id
solr_record['reference_id'] = reference_id
solr_record['context'] = context
# Chunks of 25000
if record_number % 25000 == 0:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
list_for_solr =
print(record_number)
else:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
def create_concurrent_futures():
""" Uses all the cores to do the parsing and inserting"""
folderpath = '.../'
refs_files = glob(os.path.join(folderpath, '*.txt'))
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(insert_into_solr, refs_files, chunksize=1)
if __name__ == '__main__':
create_concurrent_futures()
I read somewhere that the standard Solr installation has a hard limit of around 2.14 billion documents. Is it better to use Solrcloud (which I have never configured) when there are 100s of millions of docs? Will it help with this problem? (I also have another file with 1.4 Billion documents which needs to be indexed after this). I have only one server, is there any point trying to configure Solrcloud?
python ubuntu unix solr pysolr
I've been trying to index a lot of documents on Solr (~200 million docs). I use Pysolr to do the indexing. However, the Solr server keeps going down while indexing (sometimes after 100 million documents have been indexed, sometimes after ~180 million documents, it differs).
I'm not sure why this is happening, is it because of the open size limit, i.e., related to the warning I get while starting the server with bin/solr start?
* [WARN] * Your open file limit is currently 1024. It should be set to 65000 to avoid operational disruption.
I used multiprocessing while indexing with chunks of 25000 (but I also tried with bigger chunks and without multiprocessing and it still crashed). Is it because there are too many requests being sent to Solr? My Python code is below.
solr = pysolr.Solr('http://localhost:8983/solr/collection_name', always_commit=True)
def insert_into_solr(filepath):
""" Inserts records into an empty solr index which has already been created."""
record_number = 0
list_for_solr =
with open(filepath, "r") as file:
csv_reader = csv.reader((line.replace('', '') for line in file), delimiter='t', quoting=csv.QUOTE_NONE)
for paper_id, paper_reference_id, context in csv_reader:
# int, int, string
record_number += 1
solr_record = {}
solr_record['paper_id'] = paper_id
solr_record['reference_id'] = reference_id
solr_record['context'] = context
# Chunks of 25000
if record_number % 25000 == 0:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
list_for_solr =
print(record_number)
else:
list_for_solr.append(solr_record)
try:
solr.add(list_for_solr)
except Exception as e:
print(e, record_number, filepath)
def create_concurrent_futures():
""" Uses all the cores to do the parsing and inserting"""
folderpath = '.../'
refs_files = glob(os.path.join(folderpath, '*.txt'))
with concurrent.futures.ProcessPoolExecutor() as executor:
executor.map(insert_into_solr, refs_files, chunksize=1)
if __name__ == '__main__':
create_concurrent_futures()
I read somewhere that the standard Solr installation has a hard limit of around 2.14 billion documents. Is it better to use Solrcloud (which I have never configured) when there are 100s of millions of docs? Will it help with this problem? (I also have another file with 1.4 Billion documents which needs to be indexed after this). I have only one server, is there any point trying to configure Solrcloud?
python ubuntu unix solr pysolr
python ubuntu unix solr pysolr
asked Nov 15 '18 at 19:34
ashash
1616
1616
1
An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.
– MatsLindh
Nov 15 '18 at 20:20
Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?
– ash
Nov 15 '18 at 21:10
1
That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.
– MatsLindh
Nov 16 '18 at 9:14
Thanks @MatsLindh for the advice. That's very helpful.
– ash
Nov 17 '18 at 2:40
add a comment |
1
An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.
– MatsLindh
Nov 15 '18 at 20:20
Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?
– ash
Nov 15 '18 at 21:10
1
That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.
– MatsLindh
Nov 16 '18 at 9:14
Thanks @MatsLindh for the advice. That's very helpful.
– ash
Nov 17 '18 at 2:40
1
1
An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.
– MatsLindh
Nov 15 '18 at 20:20
An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.
– MatsLindh
Nov 15 '18 at 20:20
Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?
– ash
Nov 15 '18 at 21:10
Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?
– ash
Nov 15 '18 at 21:10
1
1
That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.
– MatsLindh
Nov 16 '18 at 9:14
That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.
– MatsLindh
Nov 16 '18 at 9:14
Thanks @MatsLindh for the advice. That's very helpful.
– ash
Nov 17 '18 at 2:40
Thanks @MatsLindh for the advice. That's very helpful.
– ash
Nov 17 '18 at 2:40
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326747%2fsolr-server-keeps-going-down-while-indexing-millions-of-docs-using-pysolr%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326747%2fsolr-server-keeps-going-down-while-indexing-millions-of-docs-using-pysolr%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
An easy test is to change the ulimit and see if it helps - see File handles and processes - ulimit settings for information. Using SolrCloud is helpful when you want to spread the set of documents across multiple servers. The 2.1b limit is per shard / core, so using a collection in SolrCloud with multiple servers (even if they're running on a single machine but with different working directories) will allow you to scale that up further.
– MatsLindh
Nov 15 '18 at 20:20
Thanks @MatsLindh . I wanted to find out before asking the the sysadmin to increase the ulimit. I setup Solr using the method in 'Getting started' (i.e., I just extracted it) rather than following the process in the 'Take Solr to production' page). I suppose using default configs might also be contributing to this issue? So from what I understand, its probably best to configure and use Solrcloud when there are so many documents, isn't it?
– ash
Nov 15 '18 at 21:10
1
That depends. It can be - if the amount of queries or total number of documents requires it. It's also easier to scale in the future if necessary, but for prototyping and hosting something for data exploration with a small number of users, it's probably not.
– MatsLindh
Nov 16 '18 at 9:14
Thanks @MatsLindh for the advice. That's very helpful.
– ash
Nov 17 '18 at 2:40