Need to scrape the data using BeautifulSoup
I am in need to get the celebrity details from https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php
Input: Time of birth as known only, except the world events in a profession, where I get nearby 22,822 celebrities. I am able to get the first page data, using the urllib2 and bs4
import re
import urllib2
from bs4 import BeautifulSoup
url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
data = "sexe=M|F&categorie[0]=0|1|2|3|4|5|6|7|8|9|10|11|12&connue=1&pays=-1&tri=0&x=33&y=13"
fp = urllib2.urlopen(url, data)
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs={'class': 'titreFiche'})
for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]
For the next 230 pages, I am unable to get the data. I used to change the URL as page equal to until end but I can't scrape. Is there any way to get those remaining data from that page?
python-2.7 web-scraping beautifulsoup
add a comment |
I am in need to get the celebrity details from https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php
Input: Time of birth as known only, except the world events in a profession, where I get nearby 22,822 celebrities. I am able to get the first page data, using the urllib2 and bs4
import re
import urllib2
from bs4 import BeautifulSoup
url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
data = "sexe=M|F&categorie[0]=0|1|2|3|4|5|6|7|8|9|10|11|12&connue=1&pays=-1&tri=0&x=33&y=13"
fp = urllib2.urlopen(url, data)
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs={'class': 'titreFiche'})
for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]
For the next 230 pages, I am unable to get the data. I used to change the URL as page equal to until end but I can't scrape. Is there any way to get those remaining data from that page?
python-2.7 web-scraping beautifulsoup
who gave me a negative score? Please tell why?
– Aravindh Thirumaran
Nov 14 '18 at 6:27
add a comment |
I am in need to get the celebrity details from https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php
Input: Time of birth as known only, except the world events in a profession, where I get nearby 22,822 celebrities. I am able to get the first page data, using the urllib2 and bs4
import re
import urllib2
from bs4 import BeautifulSoup
url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
data = "sexe=M|F&categorie[0]=0|1|2|3|4|5|6|7|8|9|10|11|12&connue=1&pays=-1&tri=0&x=33&y=13"
fp = urllib2.urlopen(url, data)
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs={'class': 'titreFiche'})
for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]
For the next 230 pages, I am unable to get the data. I used to change the URL as page equal to until end but I can't scrape. Is there any way to get those remaining data from that page?
python-2.7 web-scraping beautifulsoup
I am in need to get the celebrity details from https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php
Input: Time of birth as known only, except the world events in a profession, where I get nearby 22,822 celebrities. I am able to get the first page data, using the urllib2 and bs4
import re
import urllib2
from bs4 import BeautifulSoup
url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
data = "sexe=M|F&categorie[0]=0|1|2|3|4|5|6|7|8|9|10|11|12&connue=1&pays=-1&tri=0&x=33&y=13"
fp = urllib2.urlopen(url, data)
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs={'class': 'titreFiche'})
for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]
For the next 230 pages, I am unable to get the data. I used to change the URL as page equal to until end but I can't scrape. Is there any way to get those remaining data from that page?
python-2.7 web-scraping beautifulsoup
python-2.7 web-scraping beautifulsoup
edited Nov 13 '18 at 14:15
ewwink
11.6k22238
11.6k22238
asked Nov 13 '18 at 14:11
Aravindh ThirumaranAravindh Thirumaran
76
76
who gave me a negative score? Please tell why?
– Aravindh Thirumaran
Nov 14 '18 at 6:27
add a comment |
who gave me a negative score? Please tell why?
– Aravindh Thirumaran
Nov 14 '18 at 6:27
who gave me a negative score? Please tell why?
– Aravindh Thirumaran
Nov 14 '18 at 6:27
who gave me a negative score? Please tell why?
– Aravindh Thirumaran
Nov 14 '18 at 6:27
add a comment |
1 Answer
1
active
oldest
votes
you need session cookies, use requests
to save session easily
from bs4 import BeautifulSoup
import requests, re
url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = {
"sexe": "M|F",
"categorie[0]": "0|1|2|3|4|5|6|7|8|9|10|11|12",
"connue": 1, "pays": -1, "tri": 0, "x": 33, "y": 13
}
session = requests.session()
def doSearch(url, data=None):
if data:
fp = session.post(url, data=data).text
else:
fp = session.get(url).text
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs={'class': 'titreFiche'})
for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]
# do Post search in first request
doSearch(url, searchData)
# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
print('getting page: %s' % index)
pageurl = '%s?page=%s' % (url, index)
print(pageurl)
doSearch(pageurl)
Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks
– Aravindh Thirumaran
Nov 14 '18 at 5:49
Why my question got -1 in the score? What is the problem in my question?
– Aravindh Thirumaran
Nov 14 '18 at 5:51
I don't know but I don't do down vote to your question and you're welcome.
– ewwink
Nov 14 '18 at 8:22
Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?
– Aravindh Thirumaran
Nov 14 '18 at 11:34
it worked for examplefor index in range(200, 204):
it could be server down because your request is too fast try adding sleep between request
– ewwink
Nov 14 '18 at 12:45
|
show 1 more comment
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53282917%2fneed-to-scrape-the-data-using-beautifulsoup%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
you need session cookies, use requests
to save session easily
from bs4 import BeautifulSoup
import requests, re
url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = {
"sexe": "M|F",
"categorie[0]": "0|1|2|3|4|5|6|7|8|9|10|11|12",
"connue": 1, "pays": -1, "tri": 0, "x": 33, "y": 13
}
session = requests.session()
def doSearch(url, data=None):
if data:
fp = session.post(url, data=data).text
else:
fp = session.get(url).text
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs={'class': 'titreFiche'})
for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]
# do Post search in first request
doSearch(url, searchData)
# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
print('getting page: %s' % index)
pageurl = '%s?page=%s' % (url, index)
print(pageurl)
doSearch(pageurl)
Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks
– Aravindh Thirumaran
Nov 14 '18 at 5:49
Why my question got -1 in the score? What is the problem in my question?
– Aravindh Thirumaran
Nov 14 '18 at 5:51
I don't know but I don't do down vote to your question and you're welcome.
– ewwink
Nov 14 '18 at 8:22
Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?
– Aravindh Thirumaran
Nov 14 '18 at 11:34
it worked for examplefor index in range(200, 204):
it could be server down because your request is too fast try adding sleep between request
– ewwink
Nov 14 '18 at 12:45
|
show 1 more comment
you need session cookies, use requests
to save session easily
from bs4 import BeautifulSoup
import requests, re
url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = {
"sexe": "M|F",
"categorie[0]": "0|1|2|3|4|5|6|7|8|9|10|11|12",
"connue": 1, "pays": -1, "tri": 0, "x": 33, "y": 13
}
session = requests.session()
def doSearch(url, data=None):
if data:
fp = session.post(url, data=data).text
else:
fp = session.get(url).text
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs={'class': 'titreFiche'})
for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]
# do Post search in first request
doSearch(url, searchData)
# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
print('getting page: %s' % index)
pageurl = '%s?page=%s' % (url, index)
print(pageurl)
doSearch(pageurl)
Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks
– Aravindh Thirumaran
Nov 14 '18 at 5:49
Why my question got -1 in the score? What is the problem in my question?
– Aravindh Thirumaran
Nov 14 '18 at 5:51
I don't know but I don't do down vote to your question and you're welcome.
– ewwink
Nov 14 '18 at 8:22
Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?
– Aravindh Thirumaran
Nov 14 '18 at 11:34
it worked for examplefor index in range(200, 204):
it could be server down because your request is too fast try adding sleep between request
– ewwink
Nov 14 '18 at 12:45
|
show 1 more comment
you need session cookies, use requests
to save session easily
from bs4 import BeautifulSoup
import requests, re
url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = {
"sexe": "M|F",
"categorie[0]": "0|1|2|3|4|5|6|7|8|9|10|11|12",
"connue": 1, "pays": -1, "tri": 0, "x": 33, "y": 13
}
session = requests.session()
def doSearch(url, data=None):
if data:
fp = session.post(url, data=data).text
else:
fp = session.get(url).text
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs={'class': 'titreFiche'})
for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]
# do Post search in first request
doSearch(url, searchData)
# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
print('getting page: %s' % index)
pageurl = '%s?page=%s' % (url, index)
print(pageurl)
doSearch(pageurl)
you need session cookies, use requests
to save session easily
from bs4 import BeautifulSoup
import requests, re
url = "https://www.astrotheme.com/celestar/horoscope_celebrity_search_by_filters.php"
searchData = {
"sexe": "M|F",
"categorie[0]": "0|1|2|3|4|5|6|7|8|9|10|11|12",
"connue": 1, "pays": -1, "tri": 0, "x": 33, "y": 13
}
session = requests.session()
def doSearch(url, data=None):
if data:
fp = session.post(url, data=data).text
else:
fp = session.get(url).text
soup = BeautifulSoup(fp, 'html.parser')
from_div = soup.find_all('div', attrs={'class': 'titreFiche'})
for major in from_div:
name = re.findall(r'portrait">(.*?)<br/>', str(major))
link = re.findall(r'<a href="(.*?)"', str(major))
print name[0], link[0]
# do Post search in first request
doSearch(url, searchData)
# we have session and we can use Get request for next page
for index in range(2, 4): # get page 2 to 3
print('getting page: %s' % index)
pageurl = '%s?page=%s' % (url, index)
print(pageurl)
doSearch(pageurl)
answered Nov 13 '18 at 15:09
ewwinkewwink
11.6k22238
11.6k22238
Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks
– Aravindh Thirumaran
Nov 14 '18 at 5:49
Why my question got -1 in the score? What is the problem in my question?
– Aravindh Thirumaran
Nov 14 '18 at 5:51
I don't know but I don't do down vote to your question and you're welcome.
– ewwink
Nov 14 '18 at 8:22
Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?
– Aravindh Thirumaran
Nov 14 '18 at 11:34
it worked for examplefor index in range(200, 204):
it could be server down because your request is too fast try adding sleep between request
– ewwink
Nov 14 '18 at 12:45
|
show 1 more comment
Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks
– Aravindh Thirumaran
Nov 14 '18 at 5:49
Why my question got -1 in the score? What is the problem in my question?
– Aravindh Thirumaran
Nov 14 '18 at 5:51
I don't know but I don't do down vote to your question and you're welcome.
– ewwink
Nov 14 '18 at 8:22
Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?
– Aravindh Thirumaran
Nov 14 '18 at 11:34
it worked for examplefor index in range(200, 204):
it could be server down because your request is too fast try adding sleep between request
– ewwink
Nov 14 '18 at 12:45
Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks
– Aravindh Thirumaran
Nov 14 '18 at 5:49
Thanks, a lot man. You are awesome. It worked fine for me. Again thanks thanks and thanks
– Aravindh Thirumaran
Nov 14 '18 at 5:49
Why my question got -1 in the score? What is the problem in my question?
– Aravindh Thirumaran
Nov 14 '18 at 5:51
Why my question got -1 in the score? What is the problem in my question?
– Aravindh Thirumaran
Nov 14 '18 at 5:51
I don't know but I don't do down vote to your question and you're welcome.
– ewwink
Nov 14 '18 at 8:22
I don't know but I don't do down vote to your question and you're welcome.
– ewwink
Nov 14 '18 at 8:22
Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?
– Aravindh Thirumaran
Nov 14 '18 at 11:34
Boss, I am getting Connection error: raise ConnectionError(err, request=request) requests.exceptions.ConnectionError: ('Connection aborted.', BadStatusLine("''",)) on the line "fp = session.get(url).text" I google it but unable to get that. They ask to change https to http and another one told that add header. I tried both ways but I can't. Is there any other ways there? Actually, you used only 2 pages. I want for until last page. Is it worked fine for you next 200 pages?
– Aravindh Thirumaran
Nov 14 '18 at 11:34
it worked for example
for index in range(200, 204):
it could be server down because your request is too fast try adding sleep between request– ewwink
Nov 14 '18 at 12:45
it worked for example
for index in range(200, 204):
it could be server down because your request is too fast try adding sleep between request– ewwink
Nov 14 '18 at 12:45
|
show 1 more comment
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53282917%2fneed-to-scrape-the-data-using-beautifulsoup%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
who gave me a negative score? Please tell why?
– Aravindh Thirumaran
Nov 14 '18 at 6:27