Scrap with scrapy using saved html pages

up vote
2
down vote

favorite

I'm looking to find a way to use scrapy with html pages that I saved on my computer. As far as I am, I got an error :

requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'

SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]

asked Nov 9 at 10:03

Ayra

1146

1

1. Unless I'm being very mistaken, Scrapy has been supporting the file: scheme for quite long. 2. According to the log you shared, it looks like something generated by requests the famous HTTP client library, not Scrapy.
– starrify
Nov 9 at 10:14

For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
– Ayra
Nov 9 at 10:36

Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
– starrify
Nov 9 at 11:12

all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
– Ayra
Nov 9 at 12:07

add a comment |

up vote
2
down vote

favorite

I'm looking to find a way to use scrapy with html pages that I saved on my computer. As far as I am, I got an error :

requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'

SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]

asked Nov 9 at 10:03

Ayra

1146

1

1. Unless I'm being very mistaken, Scrapy has been supporting the file: scheme for quite long. 2. According to the log you shared, it looks like something generated by requests the famous HTTP client library, not Scrapy.
– starrify
Nov 9 at 10:14

For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
– Ayra
Nov 9 at 10:36

Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
– starrify
Nov 9 at 11:12

all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
– Ayra
Nov 9 at 12:07

add a comment |

up vote
2
down vote

favorite

I'm looking to find a way to use scrapy with html pages that I saved on my computer. As far as I am, I got an error :

requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'

SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]

asked Nov 9 at 10:03

Ayra

1146

I'm looking to find a way to use scrapy with html pages that I saved on my computer. As far as I am, I got an error :

requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'

SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]

html web-scraping scrapy local scrapy-spider

asked Nov 9 at 10:03

Ayra

1146

asked Nov 9 at 10:03

Ayra

1146

asked Nov 9 at 10:03

Ayra

1146

asked Nov 9 at 10:03

Ayra

1146

asked Nov 9 at 10:03

Ayra

1146

1

1. Unless I'm being very mistaken, Scrapy has been supporting the file: scheme for quite long. 2. According to the log you shared, it looks like something generated by requests the famous HTTP client library, not Scrapy.
– starrify
Nov 9 at 10:14

For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
– Ayra
Nov 9 at 10:36

Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
– starrify
Nov 9 at 11:12

all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
– Ayra
Nov 9 at 12:07

add a comment |

1

1. Unless I'm being very mistaken, Scrapy has been supporting the file: scheme for quite long. 2. According to the log you shared, it looks like something generated by requests the famous HTTP client library, not Scrapy.
– starrify
Nov 9 at 10:14

For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
– Ayra
Nov 9 at 10:36

Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
– starrify
Nov 9 at 11:12

all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
– Ayra
Nov 9 at 12:07

1. Unless I'm being very mistaken, Scrapy has been supporting the file: scheme for quite long. 2. According to the log you shared, it looks like something generated by requests the famous HTTP client library, not Scrapy.
– starrify
Nov 9 at 10:14

For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
– Ayra
Nov 9 at 10:36

Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
– starrify
Nov 9 at 11:12

all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
– Ayra
Nov 9 at 12:07

add a comment |

1 Answer
1

active

oldest

votes

up vote
0
down vote

I have had great success with using request_fingerprint to inject existing HTML files into HTTPCACHE_DIR (which is almost always .scrapy/httpcache/${spider_name}). Then, turning on the aforementioned http cache middleware which defaults to the file based cache storage, and the "Dummy Policy" which considers the on-disk file authoritative and won't make a network request if it finds the URL in the cache.

I would expect the script would something like (this is just the general idea, and not guaranteed to even run):

import sys

from scrapy.extensions.httpcache import FilesystemCacheStorage

from scrapy.http import Request, HtmlResponse

from scrapy.settings import Settings



# this value is the actual URL from which the on-disk file was saved

# not the "file://" version

url = sys.argv[1]

html_filename = sys.argv[2]

with open(html_filename) as fh:

    html_bytes = fh.read()

req = Request(url=url)

resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)

settings = Settings()

cache = FilesystemCacheStorage(settings)

spider = None  # fill in your Spider class here

cache.store_response(spider, req, resp)

answered Nov 10 at 21:24

Matthew L Daniel

6,46311826

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53223602%2fscrap-with-scrapy-using-saved-html-pages%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

up vote
0
down vote

I would expect the script would something like (this is just the general idea, and not guaranteed to even run):

import sys

from scrapy.extensions.httpcache import FilesystemCacheStorage

from scrapy.http import Request, HtmlResponse

from scrapy.settings import Settings



# this value is the actual URL from which the on-disk file was saved

# not the "file://" version

url = sys.argv[1]

html_filename = sys.argv[2]

with open(html_filename) as fh:

    html_bytes = fh.read()

req = Request(url=url)

resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)

settings = Settings()

cache = FilesystemCacheStorage(settings)

spider = None  # fill in your Spider class here

cache.store_response(spider, req, resp)

answered Nov 10 at 21:24

Matthew L Daniel

6,46311826

add a comment |

up vote
0
down vote

I would expect the script would something like (this is just the general idea, and not guaranteed to even run):

import sys

from scrapy.extensions.httpcache import FilesystemCacheStorage

from scrapy.http import Request, HtmlResponse

from scrapy.settings import Settings



# this value is the actual URL from which the on-disk file was saved

# not the "file://" version

url = sys.argv[1]

html_filename = sys.argv[2]

with open(html_filename) as fh:

    html_bytes = fh.read()

req = Request(url=url)

resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)

settings = Settings()

cache = FilesystemCacheStorage(settings)

spider = None  # fill in your Spider class here

cache.store_response(spider, req, resp)

answered Nov 10 at 21:24

Matthew L Daniel

6,46311826

add a comment |

up vote
0
down vote

I would expect the script would something like (this is just the general idea, and not guaranteed to even run):

import sys

from scrapy.extensions.httpcache import FilesystemCacheStorage

from scrapy.http import Request, HtmlResponse

from scrapy.settings import Settings



# this value is the actual URL from which the on-disk file was saved

# not the "file://" version

url = sys.argv[1]

html_filename = sys.argv[2]

with open(html_filename) as fh:

    html_bytes = fh.read()

req = Request(url=url)

resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)

settings = Settings()

cache = FilesystemCacheStorage(settings)

spider = None  # fill in your Spider class here

cache.store_response(spider, req, resp)

answered Nov 10 at 21:24

Matthew L Daniel

6,46311826

I would expect the script would something like (this is just the general idea, and not guaranteed to even run):

import sys

from scrapy.extensions.httpcache import FilesystemCacheStorage

from scrapy.http import Request, HtmlResponse

from scrapy.settings import Settings



# this value is the actual URL from which the on-disk file was saved

# not the "file://" version

url = sys.argv[1]

html_filename = sys.argv[2]

with open(html_filename) as fh:

    html_bytes = fh.read()

req = Request(url=url)

resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)

settings = Settings()

cache = FilesystemCacheStorage(settings)

spider = None  # fill in your Spider class here

cache.store_response(spider, req, resp)

answered Nov 10 at 21:24

Matthew L Daniel

6,46311826

answered Nov 10 at 21:24

Matthew L Daniel

6,46311826

answered Nov 10 at 21:24

Matthew L Daniel

6,46311826

answered Nov 10 at 21:24

Matthew L Daniel

6,46311826

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

yqvj6dQWU13t8FW kgr3BYWaFdkJab3krR cR89st9cY3GKCofOPaB,EO,uS w 4RyJ,bUAXiUu2zb

搜尋此網誌

Ndtyjky