Improve Web Scraping for Elements in a Container Using Selenium












1















I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:



firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
driver = webdriver.Firefox(firefox_profile=firefox_profile)


but the performance is still slow. I have tried going headless but unfortunately, it did not work, as I receive NoSuchElement errors. So is there anyway to speed up Selenium web scraping? I can't use scrapy, because this is a dynamic web scrape I need to click through the next button several times, until no clickable buttons exist, and need to click pop-up buttons as well.



here is a snippet of the code:



a = 
b =
c =
d =
e =
f =
while True:
container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')
for item in container:
time.sleep(2)
A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i,text)
time.sleep(2)
B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')
for j in B:
b.append(j.text)
time.sleep(3)
C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')
for k in C:
c.append(k.text)
time.sleep(3)
D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')
for l in D:
d.append(l.text)
time.sleep(3)
E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')
for m in E:
e.append(m.text)

try:
time.sleep(2)
next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')
next.click()
time.sleep(2)
driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break


Here is an edited version, but speed does not improve.



========================================================================
while True:
container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')
for item in container:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ui_bubble_rating bubble_")]')))
A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"recommend-titleInline noRatings")]')))
B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')
for i in B:
b.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"noQuotes")]')))
C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')
for i in C:
c.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ratingDate")]')))
D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')
for i in D:
d.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"partial_entry")]')))
E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')
for i in E:
e.append(i.text)

try:
#time.sleep(2)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"nav next taLnk ui_button primary")]')))
next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')
next.click()
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"taLnk ulBlueLinks")]')))
driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break









share|improve this question




















  • 3





    You have 17 seconds sleep in each iteration of the while. Do you think it might have something to do with it?

    – Guy
    Nov 13 '18 at 9:53








  • 1





    Consider to use Waits instead of numerous sleeps to decrease execution time. Also note that if you do web-scraping you should use Selenium only as last resort. You can try to get required data with direct API call using, for instance, request lib

    – Andersson
    Nov 13 '18 at 9:53













  • @Guy, I am also suspecting the same thing, I am looking for a more optimized way to scrape texts in a container, that has a next button at the, and an annoying pop-up.

    – Touya D. Serdan
    Nov 15 '18 at 2:22











  • Couple of things, although not sure it is going to make much difference. First, may be you don't need all these waits in the for loop, if presence of one element guarantees other. Like, clicking gives you a new row and all the elements present in the new row. Also wait until returns you the elements you are looking for. No need for another call to fetch the elements. Also I think in each call you are trying to gather all the elements again, given the xpath. As is your list might be something like 1,1,2,1,2,3 kind of pattern.

    – Biswanath
    Nov 16 '18 at 12:31
















1















I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:



firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
driver = webdriver.Firefox(firefox_profile=firefox_profile)


but the performance is still slow. I have tried going headless but unfortunately, it did not work, as I receive NoSuchElement errors. So is there anyway to speed up Selenium web scraping? I can't use scrapy, because this is a dynamic web scrape I need to click through the next button several times, until no clickable buttons exist, and need to click pop-up buttons as well.



here is a snippet of the code:



a = 
b =
c =
d =
e =
f =
while True:
container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')
for item in container:
time.sleep(2)
A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i,text)
time.sleep(2)
B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')
for j in B:
b.append(j.text)
time.sleep(3)
C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')
for k in C:
c.append(k.text)
time.sleep(3)
D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')
for l in D:
d.append(l.text)
time.sleep(3)
E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')
for m in E:
e.append(m.text)

try:
time.sleep(2)
next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')
next.click()
time.sleep(2)
driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break


Here is an edited version, but speed does not improve.



========================================================================
while True:
container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')
for item in container:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ui_bubble_rating bubble_")]')))
A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"recommend-titleInline noRatings")]')))
B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')
for i in B:
b.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"noQuotes")]')))
C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')
for i in C:
c.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ratingDate")]')))
D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')
for i in D:
d.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"partial_entry")]')))
E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')
for i in E:
e.append(i.text)

try:
#time.sleep(2)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"nav next taLnk ui_button primary")]')))
next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')
next.click()
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"taLnk ulBlueLinks")]')))
driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break









share|improve this question




















  • 3





    You have 17 seconds sleep in each iteration of the while. Do you think it might have something to do with it?

    – Guy
    Nov 13 '18 at 9:53








  • 1





    Consider to use Waits instead of numerous sleeps to decrease execution time. Also note that if you do web-scraping you should use Selenium only as last resort. You can try to get required data with direct API call using, for instance, request lib

    – Andersson
    Nov 13 '18 at 9:53













  • @Guy, I am also suspecting the same thing, I am looking for a more optimized way to scrape texts in a container, that has a next button at the, and an annoying pop-up.

    – Touya D. Serdan
    Nov 15 '18 at 2:22











  • Couple of things, although not sure it is going to make much difference. First, may be you don't need all these waits in the for loop, if presence of one element guarantees other. Like, clicking gives you a new row and all the elements present in the new row. Also wait until returns you the elements you are looking for. No need for another call to fetch the elements. Also I think in each call you are trying to gather all the elements again, given the xpath. As is your list might be something like 1,1,2,1,2,3 kind of pattern.

    – Biswanath
    Nov 16 '18 at 12:31














1












1








1








I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:



firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
driver = webdriver.Firefox(firefox_profile=firefox_profile)


but the performance is still slow. I have tried going headless but unfortunately, it did not work, as I receive NoSuchElement errors. So is there anyway to speed up Selenium web scraping? I can't use scrapy, because this is a dynamic web scrape I need to click through the next button several times, until no clickable buttons exist, and need to click pop-up buttons as well.



here is a snippet of the code:



a = 
b =
c =
d =
e =
f =
while True:
container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')
for item in container:
time.sleep(2)
A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i,text)
time.sleep(2)
B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')
for j in B:
b.append(j.text)
time.sleep(3)
C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')
for k in C:
c.append(k.text)
time.sleep(3)
D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')
for l in D:
d.append(l.text)
time.sleep(3)
E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')
for m in E:
e.append(m.text)

try:
time.sleep(2)
next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')
next.click()
time.sleep(2)
driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break


Here is an edited version, but speed does not improve.



========================================================================
while True:
container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')
for item in container:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ui_bubble_rating bubble_")]')))
A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"recommend-titleInline noRatings")]')))
B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')
for i in B:
b.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"noQuotes")]')))
C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')
for i in C:
c.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ratingDate")]')))
D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')
for i in D:
d.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"partial_entry")]')))
E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')
for i in E:
e.append(i.text)

try:
#time.sleep(2)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"nav next taLnk ui_button primary")]')))
next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')
next.click()
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"taLnk ulBlueLinks")]')))
driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break









share|improve this question
















I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:



firefox_profile = webdriver.FirefoxProfile()
firefox_profile.set_preference('permissions.default.image', 2)
firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')
firefox_profile.set_preference("browser.privatebrowsing.autostart", True)
driver = webdriver.Firefox(firefox_profile=firefox_profile)


but the performance is still slow. I have tried going headless but unfortunately, it did not work, as I receive NoSuchElement errors. So is there anyway to speed up Selenium web scraping? I can't use scrapy, because this is a dynamic web scrape I need to click through the next button several times, until no clickable buttons exist, and need to click pop-up buttons as well.



here is a snippet of the code:



a = 
b =
c =
d =
e =
f =
while True:
container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')
for item in container:
time.sleep(2)
A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i,text)
time.sleep(2)
B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')
for j in B:
b.append(j.text)
time.sleep(3)
C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')
for k in C:
c.append(k.text)
time.sleep(3)
D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')
for l in D:
d.append(l.text)
time.sleep(3)
E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')
for m in E:
e.append(m.text)

try:
time.sleep(2)
next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')
next.click()
time.sleep(2)
driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break


Here is an edited version, but speed does not improve.



========================================================================
while True:
container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')
for item in container:
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ui_bubble_rating bubble_")]')))
A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')
for i in A:
a.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"recommend-titleInline noRatings")]')))
B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')
for i in B:
b.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"noQuotes")]')))
C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')
for i in C:
c.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ratingDate")]')))
D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')
for i in D:
d.append(i.text)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"partial_entry")]')))
E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')
for i in E:
e.append(i.text)

try:
#time.sleep(2)
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"nav next taLnk ui_button primary")]')))
next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')
next.click()
WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"taLnk ulBlueLinks")]')))
driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()
except (ElementClickInterceptedException,NoSuchElementException) as e:
break






python selenium firefox web-scraping scrapy






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 16 '18 at 2:22







Touya D. Serdan

















asked Nov 13 '18 at 9:50









Touya D. SerdanTouya D. Serdan

143111




143111








  • 3





    You have 17 seconds sleep in each iteration of the while. Do you think it might have something to do with it?

    – Guy
    Nov 13 '18 at 9:53








  • 1





    Consider to use Waits instead of numerous sleeps to decrease execution time. Also note that if you do web-scraping you should use Selenium only as last resort. You can try to get required data with direct API call using, for instance, request lib

    – Andersson
    Nov 13 '18 at 9:53













  • @Guy, I am also suspecting the same thing, I am looking for a more optimized way to scrape texts in a container, that has a next button at the, and an annoying pop-up.

    – Touya D. Serdan
    Nov 15 '18 at 2:22











  • Couple of things, although not sure it is going to make much difference. First, may be you don't need all these waits in the for loop, if presence of one element guarantees other. Like, clicking gives you a new row and all the elements present in the new row. Also wait until returns you the elements you are looking for. No need for another call to fetch the elements. Also I think in each call you are trying to gather all the elements again, given the xpath. As is your list might be something like 1,1,2,1,2,3 kind of pattern.

    – Biswanath
    Nov 16 '18 at 12:31














  • 3





    You have 17 seconds sleep in each iteration of the while. Do you think it might have something to do with it?

    – Guy
    Nov 13 '18 at 9:53








  • 1





    Consider to use Waits instead of numerous sleeps to decrease execution time. Also note that if you do web-scraping you should use Selenium only as last resort. You can try to get required data with direct API call using, for instance, request lib

    – Andersson
    Nov 13 '18 at 9:53













  • @Guy, I am also suspecting the same thing, I am looking for a more optimized way to scrape texts in a container, that has a next button at the, and an annoying pop-up.

    – Touya D. Serdan
    Nov 15 '18 at 2:22











  • Couple of things, although not sure it is going to make much difference. First, may be you don't need all these waits in the for loop, if presence of one element guarantees other. Like, clicking gives you a new row and all the elements present in the new row. Also wait until returns you the elements you are looking for. No need for another call to fetch the elements. Also I think in each call you are trying to gather all the elements again, given the xpath. As is your list might be something like 1,1,2,1,2,3 kind of pattern.

    – Biswanath
    Nov 16 '18 at 12:31








3




3





You have 17 seconds sleep in each iteration of the while. Do you think it might have something to do with it?

– Guy
Nov 13 '18 at 9:53







You have 17 seconds sleep in each iteration of the while. Do you think it might have something to do with it?

– Guy
Nov 13 '18 at 9:53






1




1





Consider to use Waits instead of numerous sleeps to decrease execution time. Also note that if you do web-scraping you should use Selenium only as last resort. You can try to get required data with direct API call using, for instance, request lib

– Andersson
Nov 13 '18 at 9:53







Consider to use Waits instead of numerous sleeps to decrease execution time. Also note that if you do web-scraping you should use Selenium only as last resort. You can try to get required data with direct API call using, for instance, request lib

– Andersson
Nov 13 '18 at 9:53















@Guy, I am also suspecting the same thing, I am looking for a more optimized way to scrape texts in a container, that has a next button at the, and an annoying pop-up.

– Touya D. Serdan
Nov 15 '18 at 2:22





@Guy, I am also suspecting the same thing, I am looking for a more optimized way to scrape texts in a container, that has a next button at the, and an annoying pop-up.

– Touya D. Serdan
Nov 15 '18 at 2:22













Couple of things, although not sure it is going to make much difference. First, may be you don't need all these waits in the for loop, if presence of one element guarantees other. Like, clicking gives you a new row and all the elements present in the new row. Also wait until returns you the elements you are looking for. No need for another call to fetch the elements. Also I think in each call you are trying to gather all the elements again, given the xpath. As is your list might be something like 1,1,2,1,2,3 kind of pattern.

– Biswanath
Nov 16 '18 at 12:31





Couple of things, although not sure it is going to make much difference. First, may be you don't need all these waits in the for loop, if presence of one element guarantees other. Like, clicking gives you a new row and all the elements present in the new row. Also wait until returns you the elements you are looking for. No need for another call to fetch the elements. Also I think in each call you are trying to gather all the elements again, given the xpath. As is your list might be something like 1,1,2,1,2,3 kind of pattern.

– Biswanath
Nov 16 '18 at 12:31












1 Answer
1






active

oldest

votes


















0














For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash



Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.



Also, if you have to use selenium for scraping a good idea would be to use headless option. And also you can use chrome. I had some benchmarks for chrome headless being faster than firefox headless sometimes back.



Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.



Edit : Adding as edit while trying to answer @QHarr, as the answer is pretty long.



It is a suggestion to evaluate scrapy-splash.



I gravitate towards to scrapy because the whole eco system around scrapping purposes. Like middleware, proxies, deployment, scheduling, scaling. So bascially if you are looking for some serious scrapping scrapy might be better starting position. So that suggestion comes with caveat.



At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.



But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.



If you scrapping with any considerable no of items, the speed-up you get is by generally parallelising the requests. Also in cases, falling back to basic http request and response where it is not necessary, rather to render the page in any user agent.



Also, anecdotally, some of the in web page action can be performed using the underlying http request/response. So time is a priority then you should be looking to get as many thing as possible done with http request/response.






share|improve this answer


























  • Why is scrapy-splash a better fit please? Is it faster?

    – QHarr
    Nov 14 '18 at 7:46











  • Thanks for the suggestion, but going headless have provided me with NoSuchElement error. I tried to solve the problem by replicating the solutions provided by people here in stackoverflow, but to no avail, so I have reverted back to being "not" headless.

    – Touya D. Serdan
    Nov 15 '18 at 2:08













  • Interesting, If you can provide an url or workable solution I can have a look or post as question so that other community member can look ? I have seen this issue of NoSuchElementException for headless mode for couple of different reason.

    – Biswanath
    Nov 15 '18 at 5:19











Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53278183%2fimprove-web-scraping-for-elements-in-a-container-using-selenium%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash



Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.



Also, if you have to use selenium for scraping a good idea would be to use headless option. And also you can use chrome. I had some benchmarks for chrome headless being faster than firefox headless sometimes back.



Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.



Edit : Adding as edit while trying to answer @QHarr, as the answer is pretty long.



It is a suggestion to evaluate scrapy-splash.



I gravitate towards to scrapy because the whole eco system around scrapping purposes. Like middleware, proxies, deployment, scheduling, scaling. So bascially if you are looking for some serious scrapping scrapy might be better starting position. So that suggestion comes with caveat.



At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.



But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.



If you scrapping with any considerable no of items, the speed-up you get is by generally parallelising the requests. Also in cases, falling back to basic http request and response where it is not necessary, rather to render the page in any user agent.



Also, anecdotally, some of the in web page action can be performed using the underlying http request/response. So time is a priority then you should be looking to get as many thing as possible done with http request/response.






share|improve this answer


























  • Why is scrapy-splash a better fit please? Is it faster?

    – QHarr
    Nov 14 '18 at 7:46











  • Thanks for the suggestion, but going headless have provided me with NoSuchElement error. I tried to solve the problem by replicating the solutions provided by people here in stackoverflow, but to no avail, so I have reverted back to being "not" headless.

    – Touya D. Serdan
    Nov 15 '18 at 2:08













  • Interesting, If you can provide an url or workable solution I can have a look or post as question so that other community member can look ? I have seen this issue of NoSuchElementException for headless mode for couple of different reason.

    – Biswanath
    Nov 15 '18 at 5:19
















0














For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash



Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.



Also, if you have to use selenium for scraping a good idea would be to use headless option. And also you can use chrome. I had some benchmarks for chrome headless being faster than firefox headless sometimes back.



Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.



Edit : Adding as edit while trying to answer @QHarr, as the answer is pretty long.



It is a suggestion to evaluate scrapy-splash.



I gravitate towards to scrapy because the whole eco system around scrapping purposes. Like middleware, proxies, deployment, scheduling, scaling. So bascially if you are looking for some serious scrapping scrapy might be better starting position. So that suggestion comes with caveat.



At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.



But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.



If you scrapping with any considerable no of items, the speed-up you get is by generally parallelising the requests. Also in cases, falling back to basic http request and response where it is not necessary, rather to render the page in any user agent.



Also, anecdotally, some of the in web page action can be performed using the underlying http request/response. So time is a priority then you should be looking to get as many thing as possible done with http request/response.






share|improve this answer


























  • Why is scrapy-splash a better fit please? Is it faster?

    – QHarr
    Nov 14 '18 at 7:46











  • Thanks for the suggestion, but going headless have provided me with NoSuchElement error. I tried to solve the problem by replicating the solutions provided by people here in stackoverflow, but to no avail, so I have reverted back to being "not" headless.

    – Touya D. Serdan
    Nov 15 '18 at 2:08













  • Interesting, If you can provide an url or workable solution I can have a look or post as question so that other community member can look ? I have seen this issue of NoSuchElementException for headless mode for couple of different reason.

    – Biswanath
    Nov 15 '18 at 5:19














0












0








0







For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash



Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.



Also, if you have to use selenium for scraping a good idea would be to use headless option. And also you can use chrome. I had some benchmarks for chrome headless being faster than firefox headless sometimes back.



Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.



Edit : Adding as edit while trying to answer @QHarr, as the answer is pretty long.



It is a suggestion to evaluate scrapy-splash.



I gravitate towards to scrapy because the whole eco system around scrapping purposes. Like middleware, proxies, deployment, scheduling, scaling. So bascially if you are looking for some serious scrapping scrapy might be better starting position. So that suggestion comes with caveat.



At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.



But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.



If you scrapping with any considerable no of items, the speed-up you get is by generally parallelising the requests. Also in cases, falling back to basic http request and response where it is not necessary, rather to render the page in any user agent.



Also, anecdotally, some of the in web page action can be performed using the underlying http request/response. So time is a priority then you should be looking to get as many thing as possible done with http request/response.






share|improve this answer















For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash



Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.



Also, if you have to use selenium for scraping a good idea would be to use headless option. And also you can use chrome. I had some benchmarks for chrome headless being faster than firefox headless sometimes back.



Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.



Edit : Adding as edit while trying to answer @QHarr, as the answer is pretty long.



It is a suggestion to evaluate scrapy-splash.



I gravitate towards to scrapy because the whole eco system around scrapping purposes. Like middleware, proxies, deployment, scheduling, scaling. So bascially if you are looking for some serious scrapping scrapy might be better starting position. So that suggestion comes with caveat.



At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.



But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.



If you scrapping with any considerable no of items, the speed-up you get is by generally parallelising the requests. Also in cases, falling back to basic http request and response where it is not necessary, rather to render the page in any user agent.



Also, anecdotally, some of the in web page action can be performed using the underlying http request/response. So time is a priority then you should be looking to get as many thing as possible done with http request/response.







share|improve this answer














share|improve this answer



share|improve this answer








edited Nov 16 '18 at 8:08

























answered Nov 14 '18 at 7:24









BiswanathBiswanath

5,024103856




5,024103856













  • Why is scrapy-splash a better fit please? Is it faster?

    – QHarr
    Nov 14 '18 at 7:46











  • Thanks for the suggestion, but going headless have provided me with NoSuchElement error. I tried to solve the problem by replicating the solutions provided by people here in stackoverflow, but to no avail, so I have reverted back to being "not" headless.

    – Touya D. Serdan
    Nov 15 '18 at 2:08













  • Interesting, If you can provide an url or workable solution I can have a look or post as question so that other community member can look ? I have seen this issue of NoSuchElementException for headless mode for couple of different reason.

    – Biswanath
    Nov 15 '18 at 5:19



















  • Why is scrapy-splash a better fit please? Is it faster?

    – QHarr
    Nov 14 '18 at 7:46











  • Thanks for the suggestion, but going headless have provided me with NoSuchElement error. I tried to solve the problem by replicating the solutions provided by people here in stackoverflow, but to no avail, so I have reverted back to being "not" headless.

    – Touya D. Serdan
    Nov 15 '18 at 2:08













  • Interesting, If you can provide an url or workable solution I can have a look or post as question so that other community member can look ? I have seen this issue of NoSuchElementException for headless mode for couple of different reason.

    – Biswanath
    Nov 15 '18 at 5:19

















Why is scrapy-splash a better fit please? Is it faster?

– QHarr
Nov 14 '18 at 7:46





Why is scrapy-splash a better fit please? Is it faster?

– QHarr
Nov 14 '18 at 7:46













Thanks for the suggestion, but going headless have provided me with NoSuchElement error. I tried to solve the problem by replicating the solutions provided by people here in stackoverflow, but to no avail, so I have reverted back to being "not" headless.

– Touya D. Serdan
Nov 15 '18 at 2:08







Thanks for the suggestion, but going headless have provided me with NoSuchElement error. I tried to solve the problem by replicating the solutions provided by people here in stackoverflow, but to no avail, so I have reverted back to being "not" headless.

– Touya D. Serdan
Nov 15 '18 at 2:08















Interesting, If you can provide an url or workable solution I can have a look or post as question so that other community member can look ? I have seen this issue of NoSuchElementException for headless mode for couple of different reason.

– Biswanath
Nov 15 '18 at 5:19





Interesting, If you can provide an url or workable solution I can have a look or post as question so that other community member can look ? I have seen this issue of NoSuchElementException for headless mode for couple of different reason.

– Biswanath
Nov 15 '18 at 5:19


















draft saved

draft discarded




















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53278183%2fimprove-web-scraping-for-elements-in-a-container-using-selenium%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Florida Star v. B. J. F.

Error while running script in elastic search , gateway timeout

Adding quotations to stringified JSON object values