Improve Web Scraping for Elements in a Container Using Selenium

I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:

firefox_profile = webdriver.FirefoxProfile()

firefox_profile.set_preference('permissions.default.image', 2)

firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')

firefox_profile.set_preference("browser.privatebrowsing.autostart", True)

driver = webdriver.Firefox(firefox_profile=firefox_profile)

but the performance is still slow. I have tried going headless but unfortunately, it did not work, as I receive NoSuchElement errors. So is there anyway to speed up Selenium web scraping? I can't use scrapy, because this is a dynamic web scrape I need to click through the next button several times, until no clickable buttons exist, and need to click pop-up buttons as well.

here is a snippet of the code:

a = 

b = 

c = 

d = 

e = 

f = 

while True:

    container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')

    for item in container:

        time.sleep(2)

        A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')

        for i in A:

            a.append(i,text)

        time.sleep(2)

        B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')

        for j in B:

            b.append(j.text)

        time.sleep(3)

        C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')

        for k in C:

            c.append(k.text)

        time.sleep(3)

        D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')

        for l in D:

            d.append(l.text)

        time.sleep(3)

        E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')

        for m in E:

            e.append(m.text)



    try:

        time.sleep(2)

        next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')

        next.click()

        time.sleep(2)

        driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()

    except (ElementClickInterceptedException,NoSuchElementException) as e:

        break

Here is an edited version, but speed does not improve.

========================================================================

while True:

    container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')

    for item in container:

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ui_bubble_rating bubble_")]')))

        A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')

        for i in A:

            a.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"recommend-titleInline noRatings")]')))

        B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')

        for i in B:

            b.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"noQuotes")]')))

        C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')

        for i in C:

            c.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ratingDate")]')))

        D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')

        for i in D:

            d.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"partial_entry")]')))

        E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')

        for i in E:

            e.append(i.text)



    try:

        #time.sleep(2)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"nav next taLnk ui_button primary")]')))

        next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')

        next.click()

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"taLnk ulBlueLinks")]')))

        driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()

    except (ElementClickInterceptedException,NoSuchElementException) as e:

        break

edited Nov 16 '18 at 2:22

asked Nov 13 '18 at 9:50

Touya D. Serdan

143111

3

You have 17 seconds sleep in each iteration of the while. Do you think it might have something to do with it?

– Guy
Nov 13 '18 at 9:53

1

Consider to use Waits instead of numerous sleeps to decrease execution time. Also note that if you do web-scraping you should use Selenium only as last resort. You can try to get required data with direct API call using, for instance, request lib

– Andersson
Nov 13 '18 at 9:53

@Guy, I am also suspecting the same thing, I am looking for a more optimized way to scrape texts in a container, that has a next button at the, and an annoying pop-up.

– Touya D. Serdan
Nov 15 '18 at 2:22

Couple of things, although not sure it is going to make much difference. First, may be you don't need all these waits in the for loop, if presence of one element guarantees other. Like, clicking gives you a new row and all the elements present in the new row. Also wait until returns you the elements you are looking for. No need for another call to fetch the elements. Also I think in each call you are trying to gather all the elements again, given the xpath. As is your list might be something like 1,1,2,1,2,3 kind of pattern.

– Biswanath
Nov 16 '18 at 12:31

add a comment |

I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:

firefox_profile = webdriver.FirefoxProfile()

firefox_profile.set_preference('permissions.default.image', 2)

firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')

firefox_profile.set_preference("browser.privatebrowsing.autostart", True)

driver = webdriver.Firefox(firefox_profile=firefox_profile)

here is a snippet of the code:

a = 

b = 

c = 

d = 

e = 

f = 

while True:

    container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')

    for item in container:

        time.sleep(2)

        A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')

        for i in A:

            a.append(i,text)

        time.sleep(2)

        B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')

        for j in B:

            b.append(j.text)

        time.sleep(3)

        C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')

        for k in C:

            c.append(k.text)

        time.sleep(3)

        D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')

        for l in D:

            d.append(l.text)

        time.sleep(3)

        E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')

        for m in E:

            e.append(m.text)



    try:

        time.sleep(2)

        next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')

        next.click()

        time.sleep(2)

        driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()

    except (ElementClickInterceptedException,NoSuchElementException) as e:

        break

Here is an edited version, but speed does not improve.

========================================================================

while True:

    container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')

    for item in container:

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ui_bubble_rating bubble_")]')))

        A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')

        for i in A:

            a.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"recommend-titleInline noRatings")]')))

        B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')

        for i in B:

            b.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"noQuotes")]')))

        C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')

        for i in C:

            c.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ratingDate")]')))

        D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')

        for i in D:

            d.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"partial_entry")]')))

        E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')

        for i in E:

            e.append(i.text)



    try:

        #time.sleep(2)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"nav next taLnk ui_button primary")]')))

        next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')

        next.click()

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"taLnk ulBlueLinks")]')))

        driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()

    except (ElementClickInterceptedException,NoSuchElementException) as e:

        break

edited Nov 16 '18 at 2:22

asked Nov 13 '18 at 9:50

Touya D. Serdan

143111

3

You have 17 seconds sleep in each iteration of the while. Do you think it might have something to do with it?

– Guy
Nov 13 '18 at 9:53

1

Consider to use Waits instead of numerous sleeps to decrease execution time. Also note that if you do web-scraping you should use Selenium only as last resort. You can try to get required data with direct API call using, for instance, request lib

– Andersson
Nov 13 '18 at 9:53

@Guy, I am also suspecting the same thing, I am looking for a more optimized way to scrape texts in a container, that has a next button at the, and an annoying pop-up.

– Touya D. Serdan
Nov 15 '18 at 2:22

Couple of things, although not sure it is going to make much difference. First, may be you don't need all these waits in the for loop, if presence of one element guarantees other. Like, clicking gives you a new row and all the elements present in the new row. Also wait until returns you the elements you are looking for. No need for another call to fetch the elements. Also I think in each call you are trying to gather all the elements again, given the xpath. As is your list might be something like 1,1,2,1,2,3 kind of pattern.

– Biswanath
Nov 16 '18 at 12:31

add a comment |

I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:

firefox_profile = webdriver.FirefoxProfile()

firefox_profile.set_preference('permissions.default.image', 2)

firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')

firefox_profile.set_preference("browser.privatebrowsing.autostart", True)

driver = webdriver.Firefox(firefox_profile=firefox_profile)

here is a snippet of the code:

a = 

b = 

c = 

d = 

e = 

f = 

while True:

    container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')

    for item in container:

        time.sleep(2)

        A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')

        for i in A:

            a.append(i,text)

        time.sleep(2)

        B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')

        for j in B:

            b.append(j.text)

        time.sleep(3)

        C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')

        for k in C:

            c.append(k.text)

        time.sleep(3)

        D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')

        for l in D:

            d.append(l.text)

        time.sleep(3)

        E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')

        for m in E:

            e.append(m.text)



    try:

        time.sleep(2)

        next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')

        next.click()

        time.sleep(2)

        driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()

    except (ElementClickInterceptedException,NoSuchElementException) as e:

        break

Here is an edited version, but speed does not improve.

========================================================================

while True:

    container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')

    for item in container:

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ui_bubble_rating bubble_")]')))

        A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')

        for i in A:

            a.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"recommend-titleInline noRatings")]')))

        B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')

        for i in B:

            b.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"noQuotes")]')))

        C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')

        for i in C:

            c.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ratingDate")]')))

        D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')

        for i in D:

            d.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"partial_entry")]')))

        E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')

        for i in E:

            e.append(i.text)



    try:

        #time.sleep(2)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"nav next taLnk ui_button primary")]')))

        next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')

        next.click()

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"taLnk ulBlueLinks")]')))

        driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()

    except (ElementClickInterceptedException,NoSuchElementException) as e:

        break

edited Nov 16 '18 at 2:22

asked Nov 13 '18 at 9:50

Touya D. Serdan

143111

I am using FireFox, and my code is working just fine, except that its very slow. I prevent loading the images, just to speed up a little bit:

firefox_profile = webdriver.FirefoxProfile()

firefox_profile.set_preference('permissions.default.image', 2)

firefox_profile.set_preference('dom.ipc.plugins.enabled.libflashplayer.so', 'false')

firefox_profile.set_preference("browser.privatebrowsing.autostart", True)

driver = webdriver.Firefox(firefox_profile=firefox_profile)

here is a snippet of the code:

a = 

b = 

c = 

d = 

e = 

f = 

while True:

    container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')

    for item in container:

        time.sleep(2)

        A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')

        for i in A:

            a.append(i,text)

        time.sleep(2)

        B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')

        for j in B:

            b.append(j.text)

        time.sleep(3)

        C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')

        for k in C:

            c.append(k.text)

        time.sleep(3)

        D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')

        for l in D:

            d.append(l.text)

        time.sleep(3)

        E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')

        for m in E:

            e.append(m.text)



    try:

        time.sleep(2)

        next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')

        next.click()

        time.sleep(2)

        driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()

    except (ElementClickInterceptedException,NoSuchElementException) as e:

        break

Here is an edited version, but speed does not improve.

========================================================================

while True:

    container = driver.find_elements_by_xpath('.//*[contains(@class,"review-container")]')

    for item in container:

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ui_bubble_rating bubble_")]')))

        A = item.find_elements_by_xpath('.//*[contains(@class,"ui_bubble_rating bubble_")]')

        for i in A:

            a.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"recommend-titleInline noRatings")]')))

        B = item.find_elements_by_xpath('.//*[contains(@class,"recommend-titleInline noRatings")]')

        for i in B:

            b.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"noQuotes")]')))

        C = item.find_elements_by_xpath('.//*[contains(@class,"noQuotes")]')

        for i in C:

            c.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"ratingDate")]')))

        D = item.find_elements_by_xpath('.//*[contains(@class,"ratingDate")]')

        for i in D:

            d.append(i.text)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"partial_entry")]')))

        E = item.find_elements_by_xpath('.//*[contains(@class,"partial_entry")]')

        for i in E:

            e.append(i.text)



    try:

        #time.sleep(2)

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"nav next taLnk ui_button primary")]')))

        next = driver.find_element_by_xpath('.//*[contains(@class,"nav next taLnk ui_button primary")]')

        next.click()

        WebDriverWait(driver, 50).until(EC.element_to_be_clickable((By.XPATH,'.//*[contains(@class,"taLnk ulBlueLinks")]')))

        driver.find_element_by_xpath('.//*[contains(@class,"taLnk ulBlueLinks")]').click()

    except (ElementClickInterceptedException,NoSuchElementException) as e:

        break

python selenium firefox web-scraping scrapy

edited Nov 16 '18 at 2:22

asked Nov 13 '18 at 9:50

Touya D. Serdan

143111

edited Nov 16 '18 at 2:22

asked Nov 13 '18 at 9:50

Touya D. Serdan

143111

edited Nov 16 '18 at 2:22

asked Nov 13 '18 at 9:50

Touya D. Serdan

143111

asked Nov 13 '18 at 9:50

Touya D. Serdan

143111

asked Nov 13 '18 at 9:50

Touya D. Serdan

143111

3

You have 17 seconds sleep in each iteration of the while. Do you think it might have something to do with it?

– Guy
Nov 13 '18 at 9:53

1

Consider to use Waits instead of numerous sleeps to decrease execution time. Also note that if you do web-scraping you should use Selenium only as last resort. You can try to get required data with direct API call using, for instance, request lib

– Andersson
Nov 13 '18 at 9:53

@Guy, I am also suspecting the same thing, I am looking for a more optimized way to scrape texts in a container, that has a next button at the, and an annoying pop-up.

– Touya D. Serdan
Nov 15 '18 at 2:22

Couple of things, although not sure it is going to make much difference. First, may be you don't need all these waits in the for loop, if presence of one element guarantees other. Like, clicking gives you a new row and all the elements present in the new row. Also wait until returns you the elements you are looking for. No need for another call to fetch the elements. Also I think in each call you are trying to gather all the elements again, given the xpath. As is your list might be something like 1,1,2,1,2,3 kind of pattern.

– Biswanath
Nov 16 '18 at 12:31

add a comment |

3

You have 17 seconds sleep in each iteration of the while. Do you think it might have something to do with it?

– Guy
Nov 13 '18 at 9:53

1

Consider to use Waits instead of numerous sleeps to decrease execution time. Also note that if you do web-scraping you should use Selenium only as last resort. You can try to get required data with direct API call using, for instance, request lib

– Andersson
Nov 13 '18 at 9:53

@Guy, I am also suspecting the same thing, I am looking for a more optimized way to scrape texts in a container, that has a next button at the, and an annoying pop-up.

– Touya D. Serdan
Nov 15 '18 at 2:22

Couple of things, although not sure it is going to make much difference. First, may be you don't need all these waits in the for loop, if presence of one element guarantees other. Like, clicking gives you a new row and all the elements present in the new row. Also wait until returns you the elements you are looking for. No need for another call to fetch the elements. Also I think in each call you are trying to gather all the elements again, given the xpath. As is your list might be something like 1,1,2,1,2,3 kind of pattern.

– Biswanath
Nov 16 '18 at 12:31

You have 17 seconds sleep in each iteration of the while. Do you think it might have something to do with it?

– Guy
Nov 13 '18 at 9:53

Consider to use Waits instead of numerous sleeps to decrease execution time. Also note that if you do web-scraping you should use Selenium only as last resort. You can try to get required data with direct API call using, for instance, request lib

– Andersson
Nov 13 '18 at 9:53

@Guy, I am also suspecting the same thing, I am looking for a more optimized way to scrape texts in a container, that has a next button at the, and an annoying pop-up.

– Touya D. Serdan
Nov 15 '18 at 2:22

Couple of things, although not sure it is going to make much difference. First, may be you don't need all these waits in the for loop, if presence of one element guarantees other. Like, clicking gives you a new row and all the elements present in the new row. Also wait until returns you the elements you are looking for. No need for another call to fetch the elements. Also I think in each call you are trying to gather all the elements again, given the xpath. As is your list might be something like 1,1,2,1,2,3 kind of pattern.

– Biswanath
Nov 16 '18 at 12:31

add a comment |

1 Answer
1

active

oldest

votes

For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash

Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.

Also, if you have to use selenium for scraping a good idea would be to use headless option. And also you can use chrome. I had some benchmarks for chrome headless being faster than firefox headless sometimes back.

Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.

Edit : Adding as edit while trying to answer @QHarr, as the answer is pretty long.

It is a suggestion to evaluate scrapy-splash.

I gravitate towards to scrapy because the whole eco system around scrapping purposes. Like middleware, proxies, deployment, scheduling, scaling. So bascially if you are looking for some serious scrapping scrapy might be better starting position. So that suggestion comes with caveat.

At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.

But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.

If you scrapping with any considerable no of items, the speed-up you get is by generally parallelising the requests. Also in cases, falling back to basic http request and response where it is not necessary, rather to render the page in any user agent.

Also, anecdotally, some of the in web page action can be performed using the underlying http request/response. So time is a priority then you should be looking to get as many thing as possible done with http request/response.

edited Nov 16 '18 at 8:08

answered Nov 14 '18 at 7:24

Biswanath

5,024103856

Why is scrapy-splash a better fit please? Is it faster?

– QHarr
Nov 14 '18 at 7:46

Thanks for the suggestion, but going headless have provided me with NoSuchElement error. I tried to solve the problem by replicating the solutions provided by people here in stackoverflow, but to no avail, so I have reverted back to being "not" headless.

– Touya D. Serdan
Nov 15 '18 at 2:08

Interesting, If you can provide an url or workable solution I can have a look or post as question so that other community member can look ? I have seen this issue of NoSuchElementException for headless mode for couple of different reason.

– Biswanath
Nov 15 '18 at 5:19

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53278183%2fimprove-web-scraping-for-elements-in-a-container-using-selenium%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash

Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.

Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.

Edit : Adding as edit while trying to answer @QHarr, as the answer is pretty long.

It is a suggestion to evaluate scrapy-splash.

At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.

But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.

edited Nov 16 '18 at 8:08

answered Nov 14 '18 at 7:24

Biswanath

5,024103856

Why is scrapy-splash a better fit please? Is it faster?

– QHarr
Nov 14 '18 at 7:46

Thanks for the suggestion, but going headless have provided me with NoSuchElement error. I tried to solve the problem by replicating the solutions provided by people here in stackoverflow, but to no avail, so I have reverted back to being "not" headless.

– Touya D. Serdan
Nov 15 '18 at 2:08

Interesting, If you can provide an url or workable solution I can have a look or post as question so that other community member can look ? I have seen this issue of NoSuchElementException for headless mode for couple of different reason.

– Biswanath
Nov 15 '18 at 5:19

add a comment |

For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash

Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.

Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.

Edit : Adding as edit while trying to answer @QHarr, as the answer is pretty long.

It is a suggestion to evaluate scrapy-splash.

At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.

But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.

edited Nov 16 '18 at 8:08

answered Nov 14 '18 at 7:24

Biswanath

5,024103856

Why is scrapy-splash a better fit please? Is it faster?

– QHarr
Nov 14 '18 at 7:46

Thanks for the suggestion, but going headless have provided me with NoSuchElement error. I tried to solve the problem by replicating the solutions provided by people here in stackoverflow, but to no avail, so I have reverted back to being "not" headless.

– Touya D. Serdan
Nov 15 '18 at 2:08

Interesting, If you can provide an url or workable solution I can have a look or post as question so that other community member can look ? I have seen this issue of NoSuchElementException for headless mode for couple of different reason.

– Biswanath
Nov 15 '18 at 5:19

add a comment |

For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash

Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.

Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.

Edit : Adding as edit while trying to answer @QHarr, as the answer is pretty long.

It is a suggestion to evaluate scrapy-splash.

At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.

But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.

edited Nov 16 '18 at 8:08

answered Nov 14 '18 at 7:24

Biswanath

5,024103856

For dynamic webpages(pages rendered or augmented using javascript), I will suggest you use scrapy-splash

Not that you can't use selenium but for scrapping purposes scrapy-splash is a better fit.

Also, rather than sleep better would be to use webdriverwait with expected condition as it waits as long as necessary rather than thread sleep, which makes you to wait for the mentioned time.

Edit : Adding as edit while trying to answer @QHarr, as the answer is pretty long.

It is a suggestion to evaluate scrapy-splash.

At the coming to speed, I can't give any objective answer, as I have never contrasted and benchmarked scrapy with selenium from time perspective with any project of size.

But I would assume more or less you will be able to get comparable times, on a serial run, if you are doing the same things. As most cases the time you spend is on waiting for responses.

edited Nov 16 '18 at 8:08

answered Nov 14 '18 at 7:24

Biswanath

5,024103856

edited Nov 16 '18 at 8:08

answered Nov 14 '18 at 7:24

Biswanath

5,024103856

answered Nov 14 '18 at 7:24

Biswanath

5,024103856

answered Nov 14 '18 at 7:24

Biswanath

5,024103856

Why is scrapy-splash a better fit please? Is it faster?

– QHarr
Nov 14 '18 at 7:46

Thanks for the suggestion, but going headless have provided me with NoSuchElement error. I tried to solve the problem by replicating the solutions provided by people here in stackoverflow, but to no avail, so I have reverted back to being "not" headless.

– Touya D. Serdan
Nov 15 '18 at 2:08

Interesting, If you can provide an url or workable solution I can have a look or post as question so that other community member can look ? I have seen this issue of NoSuchElementException for headless mode for couple of different reason.

– Biswanath
Nov 15 '18 at 5:19

add a comment |

Why is scrapy-splash a better fit please? Is it faster?

– QHarr
Nov 14 '18 at 7:46

Thanks for the suggestion, but going headless have provided me with NoSuchElement error. I tried to solve the problem by replicating the solutions provided by people here in stackoverflow, but to no avail, so I have reverted back to being "not" headless.

– Touya D. Serdan
Nov 15 '18 at 2:08

Interesting, If you can provide an url or workable solution I can have a look or post as question so that other community member can look ? I have seen this issue of NoSuchElementException for headless mode for couple of different reason.

– Biswanath
Nov 15 '18 at 5:19

Why is scrapy-splash a better fit please? Is it faster?

– QHarr
Nov 14 '18 at 7:46

Thanks for the suggestion, but going headless have provided me with NoSuchElement error. I tried to solve the problem by replicating the solutions provided by people here in stackoverflow, but to no avail, so I have reverted back to being "not" headless.

– Touya D. Serdan
Nov 15 '18 at 2:08

Interesting, If you can provide an url or workable solution I can have a look or post as question so that other community member can look ? I have seen this issue of NoSuchElementException for headless mode for couple of different reason.

– Biswanath
Nov 15 '18 at 5:19

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ndtyjky