Scrapy - Simple div[@class] response.xpath attribute not returning data

I have written some scrapy code to obtain HTML links from Indeed search page results. My start URL is a http address that provides a list of job ads. I am trying to scrape the URL for each job shown on the page and the job title. My problem appears to be the titles = response.xpath attribute. If I use a job specific attribute, I get data, but when I use the attribute shown below in my code I get nothing (not even the column headers). This is despite the fact that the attribute encompasses everything that I need. Any help welcomed, as I am just a beginner.

I'm outputting to a CSV file and I've used this code successfully elsewhere, so I'm wondering if it is something about the way they have coded the target URL page. It's driving me nuts!

from scrapy.spiders import Spider

from scrapy.selector import Selector

from ICcom4.items import Scrape4Item

from scrapy.linkextractors import LinkExtractor

from scrapy.utils.response import get_base_url

from scrapy.spiders import CSVFeedSpider

import requests



class MySpider(Spider):

    name = "Scrape4"

    allowed_domains = ["indeed.co.uk"]



    start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]



    def parse(self, response):

        titles = response.xpath('//div[@class="jobsearch-SerpJobCard row result clickcard"]')



        items = 

        for titles in titles:

            item = Scrape4Item()

            base_url = get_base_url(response)

            home_url = ("http://www.indeed.co.uk")

            item ['_pageURL'] = base_url

            item ['role_titletext'] = titles.xpath('//h2/a/text()').extract()



            items.append(item)

        return items

Thanks for the guidance Elena, but I'm afraid that your suggestions made no difference. I still get no data return. I have resolved the duplicate variable (for titles in titles1) which I tested as a standalone change satisfactorily. However, the other suggestions made no difference. I also tried running the scrape with just the request for a URL to be returned, and it still didn't work. Revised example is below.

from scrapy.spiders import Spider

from scrapy.selector import Selector

from ICcom4.items import Scrape4Item

from scrapy.linkextractors import LinkExtractor

from scrapy.utils.response import get_base_url

from scrapy.spiders import CSVFeedSpider

import requests



class MySpider(Spider):

    name = "Scrape4"

    allowed_domains = ["indeed.co.uk"]



    start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]



    def parse(self, response):

        titles1 = response.css('div.jobsearch-SerpJobCard.row.result.clickcard')

        #also tried as titles = response.css('div.jobsearch-SerpJobCard row result clickcard')



        items = 

        for titles in titles1:

            item = Scrape4Item()

            base_url = get_base_url(response)

            home_url = ("http://www.indeed.co.uk")

            item ['_pageURL'] = base_url

            item ['role_titletext'] = titles.xpath('.//h2/a/text()').extract()

        #also tried as item ['role_titletext'] = titles.css('h2 a::text').extract()

            items.append(item)

        return items

EDIT:
Thanks you Thiago. That's cracked it! You're a superstar!
Thanks to you and Elena for having patience with a newbie.
Just to complete the circle for anybody else, the final code that I used that worked was as below. This returns the search page url and the job title :-) ;

from scrapy.spiders import Spider

from scrapy.selector import Selector

from ICcom4.items import Scrape4Item

from scrapy.linkextractors import LinkExtractor

from scrapy.utils.response import get_base_url

from scrapy.spiders import CSVFeedSpider

import requests



class MySpider(Spider):

    name = "Scrape4"

    allowed_domains = ["indeed.co.uk"]

    start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]



    def parse(self, response):

        titles = response.css('.jobsearch-SerpJobCard')

        items = 

        for titles in titles:

            item = Scrape4Item()

            base_url = get_base_url(response)

            home_url = ("http://www.indeed.co.uk")

            item ['_pageURL'] = base_url

            item ['role_titletext'] = title.xpath('.//h2/a/@title').extract()

        items.append(item)

        return items

edited Jan 17 at 21:37

Thiago Curvelo

2,1451629

asked Nov 14 '18 at 21:47

Jamwg

Try to use response.css('div.jobsearch-SerpJobCard.row.result.clickcard') if you want to use all classes. But you can decrease this amount. Then you have duplicate variable here for titles in titles:. And also extraction is wrong. Use .xpath('.//h2/a/text()').extract() or .css('h2 a::text').extract()

– vezunchik
Nov 15 '18 at 11:58

add a comment |

I'm outputting to a CSV file and I've used this code successfully elsewhere, so I'm wondering if it is something about the way they have coded the target URL page. It's driving me nuts!

from scrapy.spiders import Spider

from scrapy.selector import Selector

from ICcom4.items import Scrape4Item

from scrapy.linkextractors import LinkExtractor

from scrapy.utils.response import get_base_url

from scrapy.spiders import CSVFeedSpider

import requests



class MySpider(Spider):

    name = "Scrape4"

    allowed_domains = ["indeed.co.uk"]



    start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]



    def parse(self, response):

        titles = response.xpath('//div[@class="jobsearch-SerpJobCard row result clickcard"]')



        items = 

        for titles in titles:

            item = Scrape4Item()

            base_url = get_base_url(response)

            home_url = ("http://www.indeed.co.uk")

            item ['_pageURL'] = base_url

            item ['role_titletext'] = titles.xpath('//h2/a/text()').extract()



            items.append(item)

        return items

from scrapy.spiders import Spider

from scrapy.selector import Selector

from ICcom4.items import Scrape4Item

from scrapy.linkextractors import LinkExtractor

from scrapy.utils.response import get_base_url

from scrapy.spiders import CSVFeedSpider

import requests



class MySpider(Spider):

    name = "Scrape4"

    allowed_domains = ["indeed.co.uk"]



    start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]



    def parse(self, response):

        titles1 = response.css('div.jobsearch-SerpJobCard.row.result.clickcard')

        #also tried as titles = response.css('div.jobsearch-SerpJobCard row result clickcard')



        items = 

        for titles in titles1:

            item = Scrape4Item()

            base_url = get_base_url(response)

            home_url = ("http://www.indeed.co.uk")

            item ['_pageURL'] = base_url

            item ['role_titletext'] = titles.xpath('.//h2/a/text()').extract()

        #also tried as item ['role_titletext'] = titles.css('h2 a::text').extract()

            items.append(item)

        return items

from scrapy.spiders import Spider

from scrapy.selector import Selector

from ICcom4.items import Scrape4Item

from scrapy.linkextractors import LinkExtractor

from scrapy.utils.response import get_base_url

from scrapy.spiders import CSVFeedSpider

import requests



class MySpider(Spider):

    name = "Scrape4"

    allowed_domains = ["indeed.co.uk"]

    start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]



    def parse(self, response):

        titles = response.css('.jobsearch-SerpJobCard')

        items = 

        for titles in titles:

            item = Scrape4Item()

            base_url = get_base_url(response)

            home_url = ("http://www.indeed.co.uk")

            item ['_pageURL'] = base_url

            item ['role_titletext'] = title.xpath('.//h2/a/@title').extract()

        items.append(item)

        return items

edited Jan 17 at 21:37

Thiago Curvelo

2,1451629

asked Nov 14 '18 at 21:47

Jamwg

Try to use response.css('div.jobsearch-SerpJobCard.row.result.clickcard') if you want to use all classes. But you can decrease this amount. Then you have duplicate variable here for titles in titles:. And also extraction is wrong. Use .xpath('.//h2/a/text()').extract() or .css('h2 a::text').extract()

– vezunchik
Nov 15 '18 at 11:58

add a comment |

I'm outputting to a CSV file and I've used this code successfully elsewhere, so I'm wondering if it is something about the way they have coded the target URL page. It's driving me nuts!

from scrapy.spiders import Spider

from scrapy.selector import Selector

from ICcom4.items import Scrape4Item

from scrapy.linkextractors import LinkExtractor

from scrapy.utils.response import get_base_url

from scrapy.spiders import CSVFeedSpider

import requests



class MySpider(Spider):

    name = "Scrape4"

    allowed_domains = ["indeed.co.uk"]



    start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]



    def parse(self, response):

        titles = response.xpath('//div[@class="jobsearch-SerpJobCard row result clickcard"]')



        items = 

        for titles in titles:

            item = Scrape4Item()

            base_url = get_base_url(response)

            home_url = ("http://www.indeed.co.uk")

            item ['_pageURL'] = base_url

            item ['role_titletext'] = titles.xpath('//h2/a/text()').extract()



            items.append(item)

        return items

from scrapy.spiders import Spider

from scrapy.selector import Selector

from ICcom4.items import Scrape4Item

from scrapy.linkextractors import LinkExtractor

from scrapy.utils.response import get_base_url

from scrapy.spiders import CSVFeedSpider

import requests



class MySpider(Spider):

    name = "Scrape4"

    allowed_domains = ["indeed.co.uk"]



    start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]



    def parse(self, response):

        titles1 = response.css('div.jobsearch-SerpJobCard.row.result.clickcard')

        #also tried as titles = response.css('div.jobsearch-SerpJobCard row result clickcard')



        items = 

        for titles in titles1:

            item = Scrape4Item()

            base_url = get_base_url(response)

            home_url = ("http://www.indeed.co.uk")

            item ['_pageURL'] = base_url

            item ['role_titletext'] = titles.xpath('.//h2/a/text()').extract()

        #also tried as item ['role_titletext'] = titles.css('h2 a::text').extract()

            items.append(item)

        return items

from scrapy.spiders import Spider

from scrapy.selector import Selector

from ICcom4.items import Scrape4Item

from scrapy.linkextractors import LinkExtractor

from scrapy.utils.response import get_base_url

from scrapy.spiders import CSVFeedSpider

import requests



class MySpider(Spider):

    name = "Scrape4"

    allowed_domains = ["indeed.co.uk"]

    start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]



    def parse(self, response):

        titles = response.css('.jobsearch-SerpJobCard')

        items = 

        for titles in titles:

            item = Scrape4Item()

            base_url = get_base_url(response)

            home_url = ("http://www.indeed.co.uk")

            item ['_pageURL'] = base_url

            item ['role_titletext'] = title.xpath('.//h2/a/@title').extract()

        items.append(item)

        return items

edited Jan 17 at 21:37

Thiago Curvelo

2,1451629

asked Nov 14 '18 at 21:47

Jamwg

I'm outputting to a CSV file and I've used this code successfully elsewhere, so I'm wondering if it is something about the way they have coded the target URL page. It's driving me nuts!

from scrapy.spiders import Spider

from scrapy.selector import Selector

from ICcom4.items import Scrape4Item

from scrapy.linkextractors import LinkExtractor

from scrapy.utils.response import get_base_url

from scrapy.spiders import CSVFeedSpider

import requests



class MySpider(Spider):

    name = "Scrape4"

    allowed_domains = ["indeed.co.uk"]



    start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]



    def parse(self, response):

        titles = response.xpath('//div[@class="jobsearch-SerpJobCard row result clickcard"]')



        items = 

        for titles in titles:

            item = Scrape4Item()

            base_url = get_base_url(response)

            home_url = ("http://www.indeed.co.uk")

            item ['_pageURL'] = base_url

            item ['role_titletext'] = titles.xpath('//h2/a/text()').extract()



            items.append(item)

        return items

from scrapy.spiders import Spider

from scrapy.selector import Selector

from ICcom4.items import Scrape4Item

from scrapy.linkextractors import LinkExtractor

from scrapy.utils.response import get_base_url

from scrapy.spiders import CSVFeedSpider

import requests



class MySpider(Spider):

    name = "Scrape4"

    allowed_domains = ["indeed.co.uk"]



    start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]



    def parse(self, response):

        titles1 = response.css('div.jobsearch-SerpJobCard.row.result.clickcard')

        #also tried as titles = response.css('div.jobsearch-SerpJobCard row result clickcard')



        items = 

        for titles in titles1:

            item = Scrape4Item()

            base_url = get_base_url(response)

            home_url = ("http://www.indeed.co.uk")

            item ['_pageURL'] = base_url

            item ['role_titletext'] = titles.xpath('.//h2/a/text()').extract()

        #also tried as item ['role_titletext'] = titles.css('h2 a::text').extract()

            items.append(item)

        return items

from scrapy.spiders import Spider

from scrapy.selector import Selector

from ICcom4.items import Scrape4Item

from scrapy.linkextractors import LinkExtractor

from scrapy.utils.response import get_base_url

from scrapy.spiders import CSVFeedSpider

import requests



class MySpider(Spider):

    name = "Scrape4"

    allowed_domains = ["indeed.co.uk"]

    start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]



    def parse(self, response):

        titles = response.css('.jobsearch-SerpJobCard')

        items = 

        for titles in titles:

            item = Scrape4Item()

            base_url = get_base_url(response)

            home_url = ("http://www.indeed.co.uk")

            item ['_pageURL'] = base_url

            item ['role_titletext'] = title.xpath('.//h2/a/@title').extract()

        items.append(item)

        return items

python scrapy

edited Jan 17 at 21:37

Thiago Curvelo

2,1451629

asked Nov 14 '18 at 21:47

Jamwg

edited Jan 17 at 21:37

Thiago Curvelo

2,1451629

asked Nov 14 '18 at 21:47

Jamwg

edited Jan 17 at 21:37

Thiago Curvelo

2,1451629

edited Jan 17 at 21:37

Thiago Curvelo

2,1451629

edited Jan 17 at 21:37

Thiago Curvelo

2,1451629

asked Nov 14 '18 at 21:47

Jamwg

asked Nov 14 '18 at 21:47

Jamwg

asked Nov 14 '18 at 21:47

Jamwg

Try to use response.css('div.jobsearch-SerpJobCard.row.result.clickcard') if you want to use all classes. But you can decrease this amount. Then you have duplicate variable here for titles in titles:. And also extraction is wrong. Use .xpath('.//h2/a/text()').extract() or .css('h2 a::text').extract()

– vezunchik
Nov 15 '18 at 11:58

add a comment |

Try to use response.css('div.jobsearch-SerpJobCard.row.result.clickcard') if you want to use all classes. But you can decrease this amount. Then you have duplicate variable here for titles in titles:. And also extraction is wrong. Use .xpath('.//h2/a/text()').extract() or .css('h2 a::text').extract()

– vezunchik
Nov 15 '18 at 11:58

Try to use response.css('div.jobsearch-SerpJobCard.row.result.clickcard') if you want to use all classes. But you can decrease this amount. Then you have duplicate variable here for titles in titles:. And also extraction is wrong. Use .xpath('.//h2/a/text()').extract() or .css('h2 a::text').extract()

– vezunchik
Nov 15 '18 at 11:58

add a comment |

1 Answer
1

active

oldest

votes

I noticed that there is no clickcard class int the downloaded HTML code, but it is there after page load. Surely it is added by some javascript code.
As Scrapy doesn't execute javascript, you may want to double check the page source when some selector fails unexpectedly (instead of 'inspect element').
Besides that, a shorter selector like '.jobsearch-SerpJobCard' would do the job.

Regarding the question in the title, to get an attribute data you may use xpath('.//div/@class') or css('div::attr(class)'). E.g:

def parse(self, response):

    titles = response.css('.jobsearch-SerpJobCard')

    for title in titles:

        item = {}

        item['role_titletext'] = title.xpath('.//h2/a/@title').get()

        # or

        # item['role_titletext'] = title.css('h2 a::attr(title)').get()

        yield item

edited Jan 17 at 21:38

answered Nov 16 '18 at 3:28

Thiago Curvelo

2,1451629

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309208%2fscrapy-simple-divclass-response-xpath-attribute-not-returning-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Regarding the question in the title, to get an attribute data you may use xpath('.//div/@class') or css('div::attr(class)'). E.g:

def parse(self, response):

    titles = response.css('.jobsearch-SerpJobCard')

    for title in titles:

        item = {}

        item['role_titletext'] = title.xpath('.//h2/a/@title').get()

        # or

        # item['role_titletext'] = title.css('h2 a::attr(title)').get()

        yield item

edited Jan 17 at 21:38

answered Nov 16 '18 at 3:28

Thiago Curvelo

2,1451629

add a comment |

Regarding the question in the title, to get an attribute data you may use xpath('.//div/@class') or css('div::attr(class)'). E.g:

def parse(self, response):

    titles = response.css('.jobsearch-SerpJobCard')

    for title in titles:

        item = {}

        item['role_titletext'] = title.xpath('.//h2/a/@title').get()

        # or

        # item['role_titletext'] = title.css('h2 a::attr(title)').get()

        yield item

edited Jan 17 at 21:38

answered Nov 16 '18 at 3:28

Thiago Curvelo

2,1451629

add a comment |

Regarding the question in the title, to get an attribute data you may use xpath('.//div/@class') or css('div::attr(class)'). E.g:

def parse(self, response):

    titles = response.css('.jobsearch-SerpJobCard')

    for title in titles:

        item = {}

        item['role_titletext'] = title.xpath('.//h2/a/@title').get()

        # or

        # item['role_titletext'] = title.css('h2 a::attr(title)').get()

        yield item

edited Jan 17 at 21:38

answered Nov 16 '18 at 3:28

Thiago Curvelo

2,1451629

Regarding the question in the title, to get an attribute data you may use xpath('.//div/@class') or css('div::attr(class)'). E.g:

def parse(self, response):

    titles = response.css('.jobsearch-SerpJobCard')

    for title in titles:

        item = {}

        item['role_titletext'] = title.xpath('.//h2/a/@title').get()

        # or

        # item['role_titletext'] = title.css('h2 a::attr(title)').get()

        yield item

edited Jan 17 at 21:38

answered Nov 16 '18 at 3:28

Thiago Curvelo

2,1451629

edited Jan 17 at 21:38

answered Nov 16 '18 at 3:28

Thiago Curvelo

2,1451629

answered Nov 16 '18 at 3:28

Thiago Curvelo

2,1451629

answered Nov 16 '18 at 3:28

Thiago Curvelo

2,1451629

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Ndtyjky