Scrap with scrapy using saved html pages











up vote
2
down vote

favorite












I'm looking to find a way to use scrapy with html pages that I saved on my computer. As far as I am, I got an error :



requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'


SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]










share|improve this question


















  • 1




    1. Unless I'm being very mistaken, Scrapy has been supporting the file: scheme for quite long. 2. According to the log you shared, it looks like something generated by requests the famous HTTP client library, not Scrapy.
    – starrify
    Nov 9 at 10:14










  • For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
    – Ayra
    Nov 9 at 10:36










  • Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
    – starrify
    Nov 9 at 11:12










  • all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
    – Ayra
    Nov 9 at 12:07















up vote
2
down vote

favorite












I'm looking to find a way to use scrapy with html pages that I saved on my computer. As far as I am, I got an error :



requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'


SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]










share|improve this question


















  • 1




    1. Unless I'm being very mistaken, Scrapy has been supporting the file: scheme for quite long. 2. According to the log you shared, it looks like something generated by requests the famous HTTP client library, not Scrapy.
    – starrify
    Nov 9 at 10:14










  • For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
    – Ayra
    Nov 9 at 10:36










  • Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
    – starrify
    Nov 9 at 11:12










  • all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
    – Ayra
    Nov 9 at 12:07













up vote
2
down vote

favorite









up vote
2
down vote

favorite











I'm looking to find a way to use scrapy with html pages that I saved on my computer. As far as I am, I got an error :



requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'


SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]










share|improve this question













I'm looking to find a way to use scrapy with html pages that I saved on my computer. As far as I am, I got an error :



requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'


SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]







html web-scraping scrapy local scrapy-spider






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 9 at 10:03









Ayra

1146




1146








  • 1




    1. Unless I'm being very mistaken, Scrapy has been supporting the file: scheme for quite long. 2. According to the log you shared, it looks like something generated by requests the famous HTTP client library, not Scrapy.
    – starrify
    Nov 9 at 10:14










  • For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
    – Ayra
    Nov 9 at 10:36










  • Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
    – starrify
    Nov 9 at 11:12










  • all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
    – Ayra
    Nov 9 at 12:07














  • 1




    1. Unless I'm being very mistaken, Scrapy has been supporting the file: scheme for quite long. 2. According to the log you shared, it looks like something generated by requests the famous HTTP client library, not Scrapy.
    – starrify
    Nov 9 at 10:14










  • For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
    – Ayra
    Nov 9 at 10:36










  • Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
    – starrify
    Nov 9 at 11:12










  • all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
    – Ayra
    Nov 9 at 12:07








1




1




1. Unless I'm being very mistaken, Scrapy has been supporting the file: scheme for quite long. 2. According to the log you shared, it looks like something generated by requests the famous HTTP client library, not Scrapy.
– starrify
Nov 9 at 10:14




1. Unless I'm being very mistaken, Scrapy has been supporting the file: scheme for quite long. 2. According to the log you shared, it looks like something generated by requests the famous HTTP client library, not Scrapy.
– starrify
Nov 9 at 10:14












For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
– Ayra
Nov 9 at 10:36




For now i really don't know and as I'm new to scrappy i will not loose any time and use a static server
– Ayra
Nov 9 at 10:36












Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
– starrify
Nov 9 at 11:12




Sorry for not having made myself clear. I thought that you probably need to provide further information (more lines of log? some related code? etc.) before others could try digging further and helping.
– starrify
Nov 9 at 11:12












all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
– Ayra
Nov 9 at 12:07




all log : Unhandled error in Deferred: 2018-11-09 13:05:25 [twisted] CRITICAL: Traceback (most recent call last): File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py", line 1386, in _inlineCallbacks result = g.send(result) File "/home/stage/miniconda3/envs/scrapy_env/lib/python3.6/site-packages/scrapy/crawler.py", line 82, in crawl yield self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'
– Ayra
Nov 9 at 12:07












1 Answer
1






active

oldest

votes

















up vote
0
down vote













I have had great success with using request_fingerprint to inject existing HTML files into HTTPCACHE_DIR (which is almost always .scrapy/httpcache/${spider_name}). Then, turning on the aforementioned http cache middleware which defaults to the file based cache storage, and the "Dummy Policy" which considers the on-disk file authoritative and won't make a network request if it finds the URL in the cache.



I would expect the script would something like (this is just the general idea, and not guaranteed to even run):





import sys
from scrapy.extensions.httpcache import FilesystemCacheStorage
from scrapy.http import Request, HtmlResponse
from scrapy.settings import Settings

# this value is the actual URL from which the on-disk file was saved
# not the "file://" version
url = sys.argv[1]
html_filename = sys.argv[2]
with open(html_filename) as fh:
html_bytes = fh.read()
req = Request(url=url)
resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)
settings = Settings()
cache = FilesystemCacheStorage(settings)
spider = None # fill in your Spider class here
cache.store_response(spider, req, resp)





share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53223602%2fscrap-with-scrapy-using-saved-html-pages%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    0
    down vote













    I have had great success with using request_fingerprint to inject existing HTML files into HTTPCACHE_DIR (which is almost always .scrapy/httpcache/${spider_name}). Then, turning on the aforementioned http cache middleware which defaults to the file based cache storage, and the "Dummy Policy" which considers the on-disk file authoritative and won't make a network request if it finds the URL in the cache.



    I would expect the script would something like (this is just the general idea, and not guaranteed to even run):





    import sys
    from scrapy.extensions.httpcache import FilesystemCacheStorage
    from scrapy.http import Request, HtmlResponse
    from scrapy.settings import Settings

    # this value is the actual URL from which the on-disk file was saved
    # not the "file://" version
    url = sys.argv[1]
    html_filename = sys.argv[2]
    with open(html_filename) as fh:
    html_bytes = fh.read()
    req = Request(url=url)
    resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)
    settings = Settings()
    cache = FilesystemCacheStorage(settings)
    spider = None # fill in your Spider class here
    cache.store_response(spider, req, resp)





    share|improve this answer

























      up vote
      0
      down vote













      I have had great success with using request_fingerprint to inject existing HTML files into HTTPCACHE_DIR (which is almost always .scrapy/httpcache/${spider_name}). Then, turning on the aforementioned http cache middleware which defaults to the file based cache storage, and the "Dummy Policy" which considers the on-disk file authoritative and won't make a network request if it finds the URL in the cache.



      I would expect the script would something like (this is just the general idea, and not guaranteed to even run):





      import sys
      from scrapy.extensions.httpcache import FilesystemCacheStorage
      from scrapy.http import Request, HtmlResponse
      from scrapy.settings import Settings

      # this value is the actual URL from which the on-disk file was saved
      # not the "file://" version
      url = sys.argv[1]
      html_filename = sys.argv[2]
      with open(html_filename) as fh:
      html_bytes = fh.read()
      req = Request(url=url)
      resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)
      settings = Settings()
      cache = FilesystemCacheStorage(settings)
      spider = None # fill in your Spider class here
      cache.store_response(spider, req, resp)





      share|improve this answer























        up vote
        0
        down vote










        up vote
        0
        down vote









        I have had great success with using request_fingerprint to inject existing HTML files into HTTPCACHE_DIR (which is almost always .scrapy/httpcache/${spider_name}). Then, turning on the aforementioned http cache middleware which defaults to the file based cache storage, and the "Dummy Policy" which considers the on-disk file authoritative and won't make a network request if it finds the URL in the cache.



        I would expect the script would something like (this is just the general idea, and not guaranteed to even run):





        import sys
        from scrapy.extensions.httpcache import FilesystemCacheStorage
        from scrapy.http import Request, HtmlResponse
        from scrapy.settings import Settings

        # this value is the actual URL from which the on-disk file was saved
        # not the "file://" version
        url = sys.argv[1]
        html_filename = sys.argv[2]
        with open(html_filename) as fh:
        html_bytes = fh.read()
        req = Request(url=url)
        resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)
        settings = Settings()
        cache = FilesystemCacheStorage(settings)
        spider = None # fill in your Spider class here
        cache.store_response(spider, req, resp)





        share|improve this answer












        I have had great success with using request_fingerprint to inject existing HTML files into HTTPCACHE_DIR (which is almost always .scrapy/httpcache/${spider_name}). Then, turning on the aforementioned http cache middleware which defaults to the file based cache storage, and the "Dummy Policy" which considers the on-disk file authoritative and won't make a network request if it finds the URL in the cache.



        I would expect the script would something like (this is just the general idea, and not guaranteed to even run):





        import sys
        from scrapy.extensions.httpcache import FilesystemCacheStorage
        from scrapy.http import Request, HtmlResponse
        from scrapy.settings import Settings

        # this value is the actual URL from which the on-disk file was saved
        # not the "file://" version
        url = sys.argv[1]
        html_filename = sys.argv[2]
        with open(html_filename) as fh:
        html_bytes = fh.read()
        req = Request(url=url)
        resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)
        settings = Settings()
        cache = FilesystemCacheStorage(settings)
        spider = None # fill in your Spider class here
        cache.store_response(spider, req, resp)






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 10 at 21:24









        Matthew L Daniel

        6,46311826




        6,46311826






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53223602%2fscrap-with-scrapy-using-saved-html-pages%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Florida Star v. B. J. F.

            Error while running script in elastic search , gateway timeout

            Adding quotations to stringified JSON object values