Scrapy - Simple div[@class] response.xpath attribute not returning data
I have written some scrapy code to obtain HTML links from Indeed search page results. My start URL is a http address that provides a list of job ads. I am trying to scrape the URL for each job shown on the page and the job title. My problem appears to be the titles = response.xpath
attribute. If I use a job specific attribute, I get data, but when I use the attribute shown below in my code I get nothing (not even the column headers). This is despite the fact that the attribute encompasses everything that I need. Any help welcomed, as I am just a beginner.
I'm outputting to a CSV file and I've used this code successfully elsewhere, so I'm wondering if it is something about the way they have coded the target URL page. It's driving me nuts!
from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom4.items import Scrape4Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
import requests
class MySpider(Spider):
name = "Scrape4"
allowed_domains = ["indeed.co.uk"]
start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]
def parse(self, response):
titles = response.xpath('//div[@class="jobsearch-SerpJobCard row result clickcard"]')
items =
for titles in titles:
item = Scrape4Item()
base_url = get_base_url(response)
home_url = ("http://www.indeed.co.uk")
item ['_pageURL'] = base_url
item ['role_titletext'] = titles.xpath('//h2/a/text()').extract()
items.append(item)
return items
Thanks for the guidance Elena, but I'm afraid that your suggestions made no difference. I still get no data return. I have resolved the duplicate variable (for titles in titles1) which I tested as a standalone change satisfactorily. However, the other suggestions made no difference. I also tried running the scrape with just the request for a URL to be returned, and it still didn't work. Revised example is below.
from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom4.items import Scrape4Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
import requests
class MySpider(Spider):
name = "Scrape4"
allowed_domains = ["indeed.co.uk"]
start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]
def parse(self, response):
titles1 = response.css('div.jobsearch-SerpJobCard.row.result.clickcard')
#also tried as titles = response.css('div.jobsearch-SerpJobCard row result clickcard')
items =
for titles in titles1:
item = Scrape4Item()
base_url = get_base_url(response)
home_url = ("http://www.indeed.co.uk")
item ['_pageURL'] = base_url
item ['role_titletext'] = titles.xpath('.//h2/a/text()').extract()
#also tried as item ['role_titletext'] = titles.css('h2 a::text').extract()
items.append(item)
return items
EDIT:
Thanks you Thiago. That's cracked it! You're a superstar!
Thanks to you and Elena for having patience with a newbie.
Just to complete the circle for anybody else, the final code that I used that worked was as below. This returns the search page url and the job title :-) ;
from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom4.items import Scrape4Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
import requests
class MySpider(Spider):
name = "Scrape4"
allowed_domains = ["indeed.co.uk"]
start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]
def parse(self, response):
titles = response.css('.jobsearch-SerpJobCard')
items =
for titles in titles:
item = Scrape4Item()
base_url = get_base_url(response)
home_url = ("http://www.indeed.co.uk")
item ['_pageURL'] = base_url
item ['role_titletext'] = title.xpath('.//h2/a/@title').extract()
items.append(item)
return items
python scrapy
add a comment |
I have written some scrapy code to obtain HTML links from Indeed search page results. My start URL is a http address that provides a list of job ads. I am trying to scrape the URL for each job shown on the page and the job title. My problem appears to be the titles = response.xpath
attribute. If I use a job specific attribute, I get data, but when I use the attribute shown below in my code I get nothing (not even the column headers). This is despite the fact that the attribute encompasses everything that I need. Any help welcomed, as I am just a beginner.
I'm outputting to a CSV file and I've used this code successfully elsewhere, so I'm wondering if it is something about the way they have coded the target URL page. It's driving me nuts!
from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom4.items import Scrape4Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
import requests
class MySpider(Spider):
name = "Scrape4"
allowed_domains = ["indeed.co.uk"]
start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]
def parse(self, response):
titles = response.xpath('//div[@class="jobsearch-SerpJobCard row result clickcard"]')
items =
for titles in titles:
item = Scrape4Item()
base_url = get_base_url(response)
home_url = ("http://www.indeed.co.uk")
item ['_pageURL'] = base_url
item ['role_titletext'] = titles.xpath('//h2/a/text()').extract()
items.append(item)
return items
Thanks for the guidance Elena, but I'm afraid that your suggestions made no difference. I still get no data return. I have resolved the duplicate variable (for titles in titles1) which I tested as a standalone change satisfactorily. However, the other suggestions made no difference. I also tried running the scrape with just the request for a URL to be returned, and it still didn't work. Revised example is below.
from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom4.items import Scrape4Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
import requests
class MySpider(Spider):
name = "Scrape4"
allowed_domains = ["indeed.co.uk"]
start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]
def parse(self, response):
titles1 = response.css('div.jobsearch-SerpJobCard.row.result.clickcard')
#also tried as titles = response.css('div.jobsearch-SerpJobCard row result clickcard')
items =
for titles in titles1:
item = Scrape4Item()
base_url = get_base_url(response)
home_url = ("http://www.indeed.co.uk")
item ['_pageURL'] = base_url
item ['role_titletext'] = titles.xpath('.//h2/a/text()').extract()
#also tried as item ['role_titletext'] = titles.css('h2 a::text').extract()
items.append(item)
return items
EDIT:
Thanks you Thiago. That's cracked it! You're a superstar!
Thanks to you and Elena for having patience with a newbie.
Just to complete the circle for anybody else, the final code that I used that worked was as below. This returns the search page url and the job title :-) ;
from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom4.items import Scrape4Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
import requests
class MySpider(Spider):
name = "Scrape4"
allowed_domains = ["indeed.co.uk"]
start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]
def parse(self, response):
titles = response.css('.jobsearch-SerpJobCard')
items =
for titles in titles:
item = Scrape4Item()
base_url = get_base_url(response)
home_url = ("http://www.indeed.co.uk")
item ['_pageURL'] = base_url
item ['role_titletext'] = title.xpath('.//h2/a/@title').extract()
items.append(item)
return items
python scrapy
Try to useresponse.css('div.jobsearch-SerpJobCard.row.result.clickcard')
if you want to use all classes. But you can decrease this amount. Then you have duplicate variable herefor titles in titles:
. And also extraction is wrong. Use.xpath('.//h2/a/text()').extract()
or.css('h2 a::text').extract()
– vezunchik
Nov 15 '18 at 11:58
add a comment |
I have written some scrapy code to obtain HTML links from Indeed search page results. My start URL is a http address that provides a list of job ads. I am trying to scrape the URL for each job shown on the page and the job title. My problem appears to be the titles = response.xpath
attribute. If I use a job specific attribute, I get data, but when I use the attribute shown below in my code I get nothing (not even the column headers). This is despite the fact that the attribute encompasses everything that I need. Any help welcomed, as I am just a beginner.
I'm outputting to a CSV file and I've used this code successfully elsewhere, so I'm wondering if it is something about the way they have coded the target URL page. It's driving me nuts!
from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom4.items import Scrape4Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
import requests
class MySpider(Spider):
name = "Scrape4"
allowed_domains = ["indeed.co.uk"]
start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]
def parse(self, response):
titles = response.xpath('//div[@class="jobsearch-SerpJobCard row result clickcard"]')
items =
for titles in titles:
item = Scrape4Item()
base_url = get_base_url(response)
home_url = ("http://www.indeed.co.uk")
item ['_pageURL'] = base_url
item ['role_titletext'] = titles.xpath('//h2/a/text()').extract()
items.append(item)
return items
Thanks for the guidance Elena, but I'm afraid that your suggestions made no difference. I still get no data return. I have resolved the duplicate variable (for titles in titles1) which I tested as a standalone change satisfactorily. However, the other suggestions made no difference. I also tried running the scrape with just the request for a URL to be returned, and it still didn't work. Revised example is below.
from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom4.items import Scrape4Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
import requests
class MySpider(Spider):
name = "Scrape4"
allowed_domains = ["indeed.co.uk"]
start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]
def parse(self, response):
titles1 = response.css('div.jobsearch-SerpJobCard.row.result.clickcard')
#also tried as titles = response.css('div.jobsearch-SerpJobCard row result clickcard')
items =
for titles in titles1:
item = Scrape4Item()
base_url = get_base_url(response)
home_url = ("http://www.indeed.co.uk")
item ['_pageURL'] = base_url
item ['role_titletext'] = titles.xpath('.//h2/a/text()').extract()
#also tried as item ['role_titletext'] = titles.css('h2 a::text').extract()
items.append(item)
return items
EDIT:
Thanks you Thiago. That's cracked it! You're a superstar!
Thanks to you and Elena for having patience with a newbie.
Just to complete the circle for anybody else, the final code that I used that worked was as below. This returns the search page url and the job title :-) ;
from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom4.items import Scrape4Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
import requests
class MySpider(Spider):
name = "Scrape4"
allowed_domains = ["indeed.co.uk"]
start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]
def parse(self, response):
titles = response.css('.jobsearch-SerpJobCard')
items =
for titles in titles:
item = Scrape4Item()
base_url = get_base_url(response)
home_url = ("http://www.indeed.co.uk")
item ['_pageURL'] = base_url
item ['role_titletext'] = title.xpath('.//h2/a/@title').extract()
items.append(item)
return items
python scrapy
I have written some scrapy code to obtain HTML links from Indeed search page results. My start URL is a http address that provides a list of job ads. I am trying to scrape the URL for each job shown on the page and the job title. My problem appears to be the titles = response.xpath
attribute. If I use a job specific attribute, I get data, but when I use the attribute shown below in my code I get nothing (not even the column headers). This is despite the fact that the attribute encompasses everything that I need. Any help welcomed, as I am just a beginner.
I'm outputting to a CSV file and I've used this code successfully elsewhere, so I'm wondering if it is something about the way they have coded the target URL page. It's driving me nuts!
from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom4.items import Scrape4Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
import requests
class MySpider(Spider):
name = "Scrape4"
allowed_domains = ["indeed.co.uk"]
start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]
def parse(self, response):
titles = response.xpath('//div[@class="jobsearch-SerpJobCard row result clickcard"]')
items =
for titles in titles:
item = Scrape4Item()
base_url = get_base_url(response)
home_url = ("http://www.indeed.co.uk")
item ['_pageURL'] = base_url
item ['role_titletext'] = titles.xpath('//h2/a/text()').extract()
items.append(item)
return items
Thanks for the guidance Elena, but I'm afraid that your suggestions made no difference. I still get no data return. I have resolved the duplicate variable (for titles in titles1) which I tested as a standalone change satisfactorily. However, the other suggestions made no difference. I also tried running the scrape with just the request for a URL to be returned, and it still didn't work. Revised example is below.
from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom4.items import Scrape4Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
import requests
class MySpider(Spider):
name = "Scrape4"
allowed_domains = ["indeed.co.uk"]
start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]
def parse(self, response):
titles1 = response.css('div.jobsearch-SerpJobCard.row.result.clickcard')
#also tried as titles = response.css('div.jobsearch-SerpJobCard row result clickcard')
items =
for titles in titles1:
item = Scrape4Item()
base_url = get_base_url(response)
home_url = ("http://www.indeed.co.uk")
item ['_pageURL'] = base_url
item ['role_titletext'] = titles.xpath('.//h2/a/text()').extract()
#also tried as item ['role_titletext'] = titles.css('h2 a::text').extract()
items.append(item)
return items
EDIT:
Thanks you Thiago. That's cracked it! You're a superstar!
Thanks to you and Elena for having patience with a newbie.
Just to complete the circle for anybody else, the final code that I used that worked was as below. This returns the search page url and the job title :-) ;
from scrapy.spiders import Spider
from scrapy.selector import Selector
from ICcom4.items import Scrape4Item
from scrapy.linkextractors import LinkExtractor
from scrapy.utils.response import get_base_url
from scrapy.spiders import CSVFeedSpider
import requests
class MySpider(Spider):
name = "Scrape4"
allowed_domains = ["indeed.co.uk"]
start_urls = ['http://www.indeed.co.uk/jobs?as_and=a&as_phr=&as_any=&as_not=IT+construction&as_ttl=Project+Manager&as_cmp=&jt=contract&st=&salary=%C2%A310K-%C2%A3999K&radius=25&l=&fromage=2&limit=50&sort=date&psf=advsrch',]
def parse(self, response):
titles = response.css('.jobsearch-SerpJobCard')
items =
for titles in titles:
item = Scrape4Item()
base_url = get_base_url(response)
home_url = ("http://www.indeed.co.uk")
item ['_pageURL'] = base_url
item ['role_titletext'] = title.xpath('.//h2/a/@title').extract()
items.append(item)
return items
python scrapy
python scrapy
edited Jan 17 at 21:37
Thiago Curvelo
2,1451629
2,1451629
asked Nov 14 '18 at 21:47
JamwgJamwg
32
32
Try to useresponse.css('div.jobsearch-SerpJobCard.row.result.clickcard')
if you want to use all classes. But you can decrease this amount. Then you have duplicate variable herefor titles in titles:
. And also extraction is wrong. Use.xpath('.//h2/a/text()').extract()
or.css('h2 a::text').extract()
– vezunchik
Nov 15 '18 at 11:58
add a comment |
Try to useresponse.css('div.jobsearch-SerpJobCard.row.result.clickcard')
if you want to use all classes. But you can decrease this amount. Then you have duplicate variable herefor titles in titles:
. And also extraction is wrong. Use.xpath('.//h2/a/text()').extract()
or.css('h2 a::text').extract()
– vezunchik
Nov 15 '18 at 11:58
Try to use
response.css('div.jobsearch-SerpJobCard.row.result.clickcard')
if you want to use all classes. But you can decrease this amount. Then you have duplicate variable here for titles in titles:
. And also extraction is wrong. Use .xpath('.//h2/a/text()').extract()
or .css('h2 a::text').extract()
– vezunchik
Nov 15 '18 at 11:58
Try to use
response.css('div.jobsearch-SerpJobCard.row.result.clickcard')
if you want to use all classes. But you can decrease this amount. Then you have duplicate variable here for titles in titles:
. And also extraction is wrong. Use .xpath('.//h2/a/text()').extract()
or .css('h2 a::text').extract()
– vezunchik
Nov 15 '18 at 11:58
add a comment |
1 Answer
1
active
oldest
votes
I noticed that there is no clickcard
class int the downloaded HTML code, but it is there after page load. Surely it is added by some javascript code.
As Scrapy doesn't execute javascript, you may want to double check the page source when some selector fails unexpectedly (instead of 'inspect element').
Besides that, a shorter selector like '.jobsearch-SerpJobCard' would do the job.
Regarding the question in the title, to get an attribute data you may use xpath('.//div/@class')
or css('div::attr(class)')
. E.g:
def parse(self, response):
titles = response.css('.jobsearch-SerpJobCard')
for title in titles:
item = {}
item['role_titletext'] = title.xpath('.//h2/a/@title').get()
# or
# item['role_titletext'] = title.css('h2 a::attr(title)').get()
yield item
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309208%2fscrapy-simple-divclass-response-xpath-attribute-not-returning-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
I noticed that there is no clickcard
class int the downloaded HTML code, but it is there after page load. Surely it is added by some javascript code.
As Scrapy doesn't execute javascript, you may want to double check the page source when some selector fails unexpectedly (instead of 'inspect element').
Besides that, a shorter selector like '.jobsearch-SerpJobCard' would do the job.
Regarding the question in the title, to get an attribute data you may use xpath('.//div/@class')
or css('div::attr(class)')
. E.g:
def parse(self, response):
titles = response.css('.jobsearch-SerpJobCard')
for title in titles:
item = {}
item['role_titletext'] = title.xpath('.//h2/a/@title').get()
# or
# item['role_titletext'] = title.css('h2 a::attr(title)').get()
yield item
add a comment |
I noticed that there is no clickcard
class int the downloaded HTML code, but it is there after page load. Surely it is added by some javascript code.
As Scrapy doesn't execute javascript, you may want to double check the page source when some selector fails unexpectedly (instead of 'inspect element').
Besides that, a shorter selector like '.jobsearch-SerpJobCard' would do the job.
Regarding the question in the title, to get an attribute data you may use xpath('.//div/@class')
or css('div::attr(class)')
. E.g:
def parse(self, response):
titles = response.css('.jobsearch-SerpJobCard')
for title in titles:
item = {}
item['role_titletext'] = title.xpath('.//h2/a/@title').get()
# or
# item['role_titletext'] = title.css('h2 a::attr(title)').get()
yield item
add a comment |
I noticed that there is no clickcard
class int the downloaded HTML code, but it is there after page load. Surely it is added by some javascript code.
As Scrapy doesn't execute javascript, you may want to double check the page source when some selector fails unexpectedly (instead of 'inspect element').
Besides that, a shorter selector like '.jobsearch-SerpJobCard' would do the job.
Regarding the question in the title, to get an attribute data you may use xpath('.//div/@class')
or css('div::attr(class)')
. E.g:
def parse(self, response):
titles = response.css('.jobsearch-SerpJobCard')
for title in titles:
item = {}
item['role_titletext'] = title.xpath('.//h2/a/@title').get()
# or
# item['role_titletext'] = title.css('h2 a::attr(title)').get()
yield item
I noticed that there is no clickcard
class int the downloaded HTML code, but it is there after page load. Surely it is added by some javascript code.
As Scrapy doesn't execute javascript, you may want to double check the page source when some selector fails unexpectedly (instead of 'inspect element').
Besides that, a shorter selector like '.jobsearch-SerpJobCard' would do the job.
Regarding the question in the title, to get an attribute data you may use xpath('.//div/@class')
or css('div::attr(class)')
. E.g:
def parse(self, response):
titles = response.css('.jobsearch-SerpJobCard')
for title in titles:
item = {}
item['role_titletext'] = title.xpath('.//h2/a/@title').get()
# or
# item['role_titletext'] = title.css('h2 a::attr(title)').get()
yield item
edited Jan 17 at 21:38
answered Nov 16 '18 at 3:28
Thiago CurveloThiago Curvelo
2,1451629
2,1451629
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309208%2fscrapy-simple-divclass-response-xpath-attribute-not-returning-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Try to use
response.css('div.jobsearch-SerpJobCard.row.result.clickcard')
if you want to use all classes. But you can decrease this amount. Then you have duplicate variable herefor titles in titles:
. And also extraction is wrong. Use.xpath('.//h2/a/text()').extract()
or.css('h2 a::text').extract()
– vezunchik
Nov 15 '18 at 11:58