Scrapy get text out of span











up vote
3
down vote

favorite












URL: https://myanimelist.net/anime/236/Es_Otherwise



I trying to scrape the following content in URL:



enter image description here



I tried :



for i in response.css('span[class = dark_text]') :
i.xpath('/following-sibling::text()')


or that current XPath who's don't work or I missed something...



aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')

producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")
licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')
studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')
studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')
str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')
ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')
japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')
source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')
genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]
genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')
number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')
popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')
members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')
favorite_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')


but I figured out that some text are out of a span class, so I would like to get that text out of span with a css/XPath formula.










share|improve this question
























  • Hi. Please can you write a paragraph or so to better explain your question?
    – user
    Nov 10 at 15:17










  • What language do you want to use? Do you have a deal with that site to scrape the content?
    – bestprogrammerintheworld
    Nov 10 at 16:38










  • I use python with scrapy framework
    – user9176398
    Nov 10 at 20:50















up vote
3
down vote

favorite












URL: https://myanimelist.net/anime/236/Es_Otherwise



I trying to scrape the following content in URL:



enter image description here



I tried :



for i in response.css('span[class = dark_text]') :
i.xpath('/following-sibling::text()')


or that current XPath who's don't work or I missed something...



aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')

producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")
licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')
studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')
studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')
str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')
ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')
japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')
source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')
genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]
genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')
number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')
popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')
members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')
favorite_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')


but I figured out that some text are out of a span class, so I would like to get that text out of span with a css/XPath formula.










share|improve this question
























  • Hi. Please can you write a paragraph or so to better explain your question?
    – user
    Nov 10 at 15:17










  • What language do you want to use? Do you have a deal with that site to scrape the content?
    – bestprogrammerintheworld
    Nov 10 at 16:38










  • I use python with scrapy framework
    – user9176398
    Nov 10 at 20:50













up vote
3
down vote

favorite









up vote
3
down vote

favorite











URL: https://myanimelist.net/anime/236/Es_Otherwise



I trying to scrape the following content in URL:



enter image description here



I tried :



for i in response.css('span[class = dark_text]') :
i.xpath('/following-sibling::text()')


or that current XPath who's don't work or I missed something...



aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')

producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")
licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')
studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')
studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')
str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')
ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')
japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')
source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')
genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]
genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')
number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')
popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')
members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')
favorite_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')


but I figured out that some text are out of a span class, so I would like to get that text out of span with a css/XPath formula.










share|improve this question















URL: https://myanimelist.net/anime/236/Es_Otherwise



I trying to scrape the following content in URL:



enter image description here



I tried :



for i in response.css('span[class = dark_text]') :
i.xpath('/following-sibling::text()')


or that current XPath who's don't work or I missed something...



aired_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[11]/text()')

producer_xpath = response.xpath("//*[@id='content']/table/tbody/tr/td[1]/div/div[12]/span/a/@href/text()")
licensor_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[13]/a/text()')
studio_xpath response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[14]/a/@href/title/text()')
studio_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[17]/text()')
str_rating_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[18]/text()')
ranked_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[20]/span/text()')
japanese_title_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[7]/text()')
source_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[15]/text()')
genre_xpath = [response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a[{0}]'.format(i)) for i in range(1,4)]
genre_xpath_v2 = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[16]/a/@href/text()')
number_of_users_rated_anime_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[19]/span[3]/text()')
popularity_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[21]/span/text()')
members_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[22]/span/text()')
favorite_xpath = response.xpath('//*[@id="content"]/table/tbody/tr/td[1]/div/div[23]/span/text()')


but I figured out that some text are out of a span class, so I would like to get that text out of span with a css/XPath formula.







python html css scrapy






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 11 at 8:41









quant

1,25911126




1,25911126










asked Nov 10 at 15:10









user9176398

10210




10210












  • Hi. Please can you write a paragraph or so to better explain your question?
    – user
    Nov 10 at 15:17










  • What language do you want to use? Do you have a deal with that site to scrape the content?
    – bestprogrammerintheworld
    Nov 10 at 16:38










  • I use python with scrapy framework
    – user9176398
    Nov 10 at 20:50


















  • Hi. Please can you write a paragraph or so to better explain your question?
    – user
    Nov 10 at 15:17










  • What language do you want to use? Do you have a deal with that site to scrape the content?
    – bestprogrammerintheworld
    Nov 10 at 16:38










  • I use python with scrapy framework
    – user9176398
    Nov 10 at 20:50
















Hi. Please can you write a paragraph or so to better explain your question?
– user
Nov 10 at 15:17




Hi. Please can you write a paragraph or so to better explain your question?
– user
Nov 10 at 15:17












What language do you want to use? Do you have a deal with that site to scrape the content?
– bestprogrammerintheworld
Nov 10 at 16:38




What language do you want to use? Do you have a deal with that site to scrape the content?
– bestprogrammerintheworld
Nov 10 at 16:38












I use python with scrapy framework
– user9176398
Nov 10 at 20:50




I use python with scrapy framework
– user9176398
Nov 10 at 20:50












2 Answers
2






active

oldest

votes

















up vote
0
down vote













If you are only trying to scrap the information that you mentioned in the image you can just make use of



response.xpath('//div[@class="space-it"]//text()').extract()


Or i am unable to understand your question properly.






share|improve this answer





















  • That following syntax return empty list
    – user9176398
    Nov 10 at 20:49










  • Have You changed the class name? actually the class name is spaceit
    – Gaurav
    Nov 11 at 15:19












  • For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
    – Gaurav
    Nov 11 at 15:41










  • just it won't return you alternative name and type
    – Gaurav
    Nov 11 at 15:43


















up vote
0
down vote













it simpler to just loop through div inside the table



foundH2 = False
response = Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')

for resp in response:
tagName = resp.xpath('name()').extract_first()
if 'h2' == tagName:
foundH2 = True
if foundH2:
# start adding 'info' after <h2>Alternative Titles</h2> found
info = None
if 'div' == tagName:
for item in resp.xpath('.//text()').extract():
if 'googletag.' in item: break
item = item.strip()
if item and item != ',':
info = info + " " + item if info else item
if info:
print info


just my opinion, beautifulSoup is faster and better than scrapy.






share|improve this answer





















  • Thanks it works, but what's name and googletag ? can you explain a bit your code please.
    – user9176398
    Nov 11 at 9:03












  • it div content after Favorites: 27, and it will stop loop after it found
    – ewwink
    Nov 11 at 9:04













Your Answer






StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














 

draft saved


draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53240264%2fscrapy-get-text-out-of-span%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes








up vote
0
down vote













If you are only trying to scrap the information that you mentioned in the image you can just make use of



response.xpath('//div[@class="space-it"]//text()').extract()


Or i am unable to understand your question properly.






share|improve this answer





















  • That following syntax return empty list
    – user9176398
    Nov 10 at 20:49










  • Have You changed the class name? actually the class name is spaceit
    – Gaurav
    Nov 11 at 15:19












  • For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
    – Gaurav
    Nov 11 at 15:41










  • just it won't return you alternative name and type
    – Gaurav
    Nov 11 at 15:43















up vote
0
down vote













If you are only trying to scrap the information that you mentioned in the image you can just make use of



response.xpath('//div[@class="space-it"]//text()').extract()


Or i am unable to understand your question properly.






share|improve this answer





















  • That following syntax return empty list
    – user9176398
    Nov 10 at 20:49










  • Have You changed the class name? actually the class name is spaceit
    – Gaurav
    Nov 11 at 15:19












  • For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
    – Gaurav
    Nov 11 at 15:41










  • just it won't return you alternative name and type
    – Gaurav
    Nov 11 at 15:43













up vote
0
down vote










up vote
0
down vote









If you are only trying to scrap the information that you mentioned in the image you can just make use of



response.xpath('//div[@class="space-it"]//text()').extract()


Or i am unable to understand your question properly.






share|improve this answer












If you are only trying to scrap the information that you mentioned in the image you can just make use of



response.xpath('//div[@class="space-it"]//text()').extract()


Or i am unable to understand your question properly.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 10 at 17:18









Gaurav

14




14












  • That following syntax return empty list
    – user9176398
    Nov 10 at 20:49










  • Have You changed the class name? actually the class name is spaceit
    – Gaurav
    Nov 11 at 15:19












  • For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
    – Gaurav
    Nov 11 at 15:41










  • just it won't return you alternative name and type
    – Gaurav
    Nov 11 at 15:43


















  • That following syntax return empty list
    – user9176398
    Nov 10 at 20:49










  • Have You changed the class name? actually the class name is spaceit
    – Gaurav
    Nov 11 at 15:19












  • For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
    – Gaurav
    Nov 11 at 15:41










  • just it won't return you alternative name and type
    – Gaurav
    Nov 11 at 15:43
















That following syntax return empty list
– user9176398
Nov 10 at 20:49




That following syntax return empty list
– user9176398
Nov 10 at 20:49












Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19






Have You changed the class name? actually the class name is spaceit
– Gaurav
Nov 11 at 15:19














For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41




For a better Result You can try response.xpath('//div[@class="js-scrollfix-bottom"]//div[@class="spaceit"]
– Gaurav
Nov 11 at 15:41












just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43




just it won't return you alternative name and type
– Gaurav
Nov 11 at 15:43












up vote
0
down vote













it simpler to just loop through div inside the table



foundH2 = False
response = Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')

for resp in response:
tagName = resp.xpath('name()').extract_first()
if 'h2' == tagName:
foundH2 = True
if foundH2:
# start adding 'info' after <h2>Alternative Titles</h2> found
info = None
if 'div' == tagName:
for item in resp.xpath('.//text()').extract():
if 'googletag.' in item: break
item = item.strip()
if item and item != ',':
info = info + " " + item if info else item
if info:
print info


just my opinion, beautifulSoup is faster and better than scrapy.






share|improve this answer





















  • Thanks it works, but what's name and googletag ? can you explain a bit your code please.
    – user9176398
    Nov 11 at 9:03












  • it div content after Favorites: 27, and it will stop loop after it found
    – ewwink
    Nov 11 at 9:04

















up vote
0
down vote













it simpler to just loop through div inside the table



foundH2 = False
response = Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')

for resp in response:
tagName = resp.xpath('name()').extract_first()
if 'h2' == tagName:
foundH2 = True
if foundH2:
# start adding 'info' after <h2>Alternative Titles</h2> found
info = None
if 'div' == tagName:
for item in resp.xpath('.//text()').extract():
if 'googletag.' in item: break
item = item.strip()
if item and item != ',':
info = info + " " + item if info else item
if info:
print info


just my opinion, beautifulSoup is faster and better than scrapy.






share|improve this answer





















  • Thanks it works, but what's name and googletag ? can you explain a bit your code please.
    – user9176398
    Nov 11 at 9:03












  • it div content after Favorites: 27, and it will stop loop after it found
    – ewwink
    Nov 11 at 9:04















up vote
0
down vote










up vote
0
down vote









it simpler to just loop through div inside the table



foundH2 = False
response = Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')

for resp in response:
tagName = resp.xpath('name()').extract_first()
if 'h2' == tagName:
foundH2 = True
if foundH2:
# start adding 'info' after <h2>Alternative Titles</h2> found
info = None
if 'div' == tagName:
for item in resp.xpath('.//text()').extract():
if 'googletag.' in item: break
item = item.strip()
if item and item != ',':
info = info + " " + item if info else item
if info:
print info


just my opinion, beautifulSoup is faster and better than scrapy.






share|improve this answer












it simpler to just loop through div inside the table



foundH2 = False
response = Selector(text=htmlString).xpath('//*[@id="content"]/table/tr/td[1]/div/*')

for resp in response:
tagName = resp.xpath('name()').extract_first()
if 'h2' == tagName:
foundH2 = True
if foundH2:
# start adding 'info' after <h2>Alternative Titles</h2> found
info = None
if 'div' == tagName:
for item in resp.xpath('.//text()').extract():
if 'googletag.' in item: break
item = item.strip()
if item and item != ',':
info = info + " " + item if info else item
if info:
print info


just my opinion, beautifulSoup is faster and better than scrapy.







share|improve this answer












share|improve this answer



share|improve this answer










answered Nov 10 at 21:19









ewwink

5,64922232




5,64922232












  • Thanks it works, but what's name and googletag ? can you explain a bit your code please.
    – user9176398
    Nov 11 at 9:03












  • it div content after Favorites: 27, and it will stop loop after it found
    – ewwink
    Nov 11 at 9:04




















  • Thanks it works, but what's name and googletag ? can you explain a bit your code please.
    – user9176398
    Nov 11 at 9:03












  • it div content after Favorites: 27, and it will stop loop after it found
    – ewwink
    Nov 11 at 9:04


















Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03






Thanks it works, but what's name and googletag ? can you explain a bit your code please.
– user9176398
Nov 11 at 9:03














it div content after Favorites: 27, and it will stop loop after it found
– ewwink
Nov 11 at 9:04






it div content after Favorites: 27, and it will stop loop after it found
– ewwink
Nov 11 at 9:04




















 

draft saved


draft discarded



















































 


draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53240264%2fscrapy-get-text-out-of-span%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Florida Star v. B. J. F.

Danny Elfman

Retrieve a Users Dashboard in Tumblr with R and TumblR. Oauth Issues