Scrap with scrapy using saved html pages
up vote
2
down vote
favorite
I'm looking to find a way to use scrapy with html pages that I saved on my computer. As far as I am, I got an error :
requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]
html web-scraping scrapy local scrapy-spider
add a comment |
up vote
2
down vote
favorite
I'm looking to find a way to use scrapy with html pages that I saved on my computer. As far as I am, I got an error :
requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]
html web-scraping scrapy local scrapy-spider
1
1. Unless I'm being very mistaken, Scrapy has been supporting thefile:
scheme for quite long. 2. According to the log you shared, it looks like something generated byrequests
the famous HTTP client library, not Scrapy.
– starrify
Nov 9 at 10:14
For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
– Ayra
Nov 9 at 10:36
Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
– starrify
Nov 9 at 11:12
all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
– Ayra
Nov 9 at 12:07
add a comment |
up vote
2
down vote
favorite
up vote
2
down vote
favorite
I'm looking to find a way to use scrapy with html pages that I saved on my computer. As far as I am, I got an error :
requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]
html web-scraping scrapy local scrapy-spider
I'm looking to find a way to use scrapy with html pages that I saved on my computer. As far as I am, I got an error :
requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]
html web-scraping scrapy local scrapy-spider
html web-scraping scrapy local scrapy-spider
asked Nov 9 at 10:03
Ayra
1146
1146
1
1. Unless I'm being very mistaken, Scrapy has been supporting thefile:
scheme for quite long. 2. According to the log you shared, it looks like something generated byrequests
the famous HTTP client library, not Scrapy.
– starrify
Nov 9 at 10:14
For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
– Ayra
Nov 9 at 10:36
Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
– starrify
Nov 9 at 11:12
all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
– Ayra
Nov 9 at 12:07
add a comment |
1
1. Unless I'm being very mistaken, Scrapy has been supporting thefile:
scheme for quite long. 2. According to the log you shared, it looks like something generated byrequests
the famous HTTP client library, not Scrapy.
– starrify
Nov 9 at 10:14
For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
– Ayra
Nov 9 at 10:36
Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
– starrify
Nov 9 at 11:12
all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
– Ayra
Nov 9 at 12:07
1
1
1. Unless I'm being very mistaken, Scrapy has been supporting the
file:
scheme for quite long. 2. According to the log you shared, it looks like something generated by requests
the famous HTTP client library, not Scrapy.– starrify
Nov 9 at 10:14
1. Unless I'm being very mistaken, Scrapy has been supporting the
file:
scheme for quite long. 2. According to the log you shared, it looks like something generated by requests
the famous HTTP client library, not Scrapy.– starrify
Nov 9 at 10:14
For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
– Ayra
Nov 9 at 10:36
For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
– Ayra
Nov 9 at 10:36
Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
– starrify
Nov 9 at 11:12
Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
– starrify
Nov 9 at 11:12
all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
– Ayra
Nov 9 at 12:07
all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
– Ayra
Nov 9 at 12:07
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
I have had great success with using request_fingerprint
to inject existing HTML files into HTTPCACHE_DIR
(which is almost always .scrapy/httpcache/${spider_name}
). Then, turning on the aforementioned http cache middleware which defaults to the file based cache storage, and the "Dummy Policy" which considers the on-disk file authoritative and won't make a network request if it finds the URL in the cache.
I would expect the script would something like (this is just the general idea, and not guaranteed to even run):
import sys
from scrapy.extensions.httpcache import FilesystemCacheStorage
from scrapy.http import Request, HtmlResponse
from scrapy.settings import Settings
# this value is the actual URL from which the on-disk file was saved
# not the "file://" version
url = sys.argv[1]
html_filename = sys.argv[2]
with open(html_filename) as fh:
html_bytes = fh.read()
req = Request(url=url)
resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)
settings = Settings()
cache = FilesystemCacheStorage(settings)
spider = None # fill in your Spider class here
cache.store_response(spider, req, resp)
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
I have had great success with using request_fingerprint
to inject existing HTML files into HTTPCACHE_DIR
(which is almost always .scrapy/httpcache/${spider_name}
). Then, turning on the aforementioned http cache middleware which defaults to the file based cache storage, and the "Dummy Policy" which considers the on-disk file authoritative and won't make a network request if it finds the URL in the cache.
I would expect the script would something like (this is just the general idea, and not guaranteed to even run):
import sys
from scrapy.extensions.httpcache import FilesystemCacheStorage
from scrapy.http import Request, HtmlResponse
from scrapy.settings import Settings
# this value is the actual URL from which the on-disk file was saved
# not the "file://" version
url = sys.argv[1]
html_filename = sys.argv[2]
with open(html_filename) as fh:
html_bytes = fh.read()
req = Request(url=url)
resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)
settings = Settings()
cache = FilesystemCacheStorage(settings)
spider = None # fill in your Spider class here
cache.store_response(spider, req, resp)
add a comment |
up vote
0
down vote
I have had great success with using request_fingerprint
to inject existing HTML files into HTTPCACHE_DIR
(which is almost always .scrapy/httpcache/${spider_name}
). Then, turning on the aforementioned http cache middleware which defaults to the file based cache storage, and the "Dummy Policy" which considers the on-disk file authoritative and won't make a network request if it finds the URL in the cache.
I would expect the script would something like (this is just the general idea, and not guaranteed to even run):
import sys
from scrapy.extensions.httpcache import FilesystemCacheStorage
from scrapy.http import Request, HtmlResponse
from scrapy.settings import Settings
# this value is the actual URL from which the on-disk file was saved
# not the "file://" version
url = sys.argv[1]
html_filename = sys.argv[2]
with open(html_filename) as fh:
html_bytes = fh.read()
req = Request(url=url)
resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)
settings = Settings()
cache = FilesystemCacheStorage(settings)
spider = None # fill in your Spider class here
cache.store_response(spider, req, resp)
add a comment |
up vote
0
down vote
up vote
0
down vote
I have had great success with using request_fingerprint
to inject existing HTML files into HTTPCACHE_DIR
(which is almost always .scrapy/httpcache/${spider_name}
). Then, turning on the aforementioned http cache middleware which defaults to the file based cache storage, and the "Dummy Policy" which considers the on-disk file authoritative and won't make a network request if it finds the URL in the cache.
I would expect the script would something like (this is just the general idea, and not guaranteed to even run):
import sys
from scrapy.extensions.httpcache import FilesystemCacheStorage
from scrapy.http import Request, HtmlResponse
from scrapy.settings import Settings
# this value is the actual URL from which the on-disk file was saved
# not the "file://" version
url = sys.argv[1]
html_filename = sys.argv[2]
with open(html_filename) as fh:
html_bytes = fh.read()
req = Request(url=url)
resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)
settings = Settings()
cache = FilesystemCacheStorage(settings)
spider = None # fill in your Spider class here
cache.store_response(spider, req, resp)
I have had great success with using request_fingerprint
to inject existing HTML files into HTTPCACHE_DIR
(which is almost always .scrapy/httpcache/${spider_name}
). Then, turning on the aforementioned http cache middleware which defaults to the file based cache storage, and the "Dummy Policy" which considers the on-disk file authoritative and won't make a network request if it finds the URL in the cache.
I would expect the script would something like (this is just the general idea, and not guaranteed to even run):
import sys
from scrapy.extensions.httpcache import FilesystemCacheStorage
from scrapy.http import Request, HtmlResponse
from scrapy.settings import Settings
# this value is the actual URL from which the on-disk file was saved
# not the "file://" version
url = sys.argv[1]
html_filename = sys.argv[2]
with open(html_filename) as fh:
html_bytes = fh.read()
req = Request(url=url)
resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)
settings = Settings()
cache = FilesystemCacheStorage(settings)
spider = None # fill in your Spider class here
cache.store_response(spider, req, resp)
answered Nov 10 at 21:24
Matthew L Daniel
6,46311826
6,46311826
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53223602%2fscrap-with-scrapy-using-saved-html-pages%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
1. Unless I'm being very mistaken, Scrapy has been supporting the
file:
scheme for quite long. 2. According to the log you shared, it looks like something generated byrequests
the famous HTTP client library, not Scrapy.– starrify
Nov 9 at 10:14
For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
– Ayra
Nov 9 at 10:36
Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
– starrify
Nov 9 at 11:12
all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
– Ayra
Nov 9 at 12:07