Need Proxy?

BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.

Find out more


Scrapy handle 302 response code

Question

I am using a simple CrawlSpider implementation to crawl websites. By default Scrapy follows 302 redirects to target locations and kind of ignores the originally requested link. On a particular site I encountered a page which 302 redirects to another page. What I aim to do is log both the original link(which responds 302) and the target location(specified in HTTP response header) and process them in parse_item method of CrawlSpider. Please guide me, how can I achieve this ?

I came across solutions mentioning to use dont_redirect=True or REDIRECT_ENABLE=False but I do not actually want to ignore the redirects, in fact I want to consider(i.e not ignore) the redirecting page as well.

eg: I visit http://www.example.com/page1 which sends a 302 redirect HTTP response and redirects to http://www.example.com/page2. By default, scrapy ignore page1, follows to page2 and processes it. I want to process both page1 and page2 in parse_item.

EDIT I am already using handle_httpstatus_list = [500, 404] in class definition of spider to handle 500 and 404 response codes in parse_item, but the same is not working for 302 if I specify it in handle_httpstatus_list.

Answer

Scrapy 1.0.5 (latest official as I write these lines) does not use handle_httpstatus_list in the built-in RedirectMiddleware -- see this issue. From Scrapy 1.1.0 (1.1.0rc1 is available), the issue is fixed.

Even if you disable redirects, you can still mimic its behavior in your callback, checking the Location header and returning a Request to the redirection

Example spider:

$ cat redirecttest.py
import scrapy


class RedirectTest(scrapy.Spider):

    name = "redirecttest"
    start_urls = [
        'http://httpbin.org/get',
        'https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip'
    ]
    handle_httpstatus_list = [302]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, dont_filter=True, callback=self.parse_page)

    def parse_page(self, response):
        self.logger.debug("(parse_page) response: status=%d, URL=%s" % (response.status, response.url))
        if response.status in (302,) and 'Location' in response.headers:
            self.logger.debug("(parse_page) Location header: %r" % response.headers['Location'])
            yield scrapy.Request(
                response.urljoin(response.headers['Location']),
                callback=self.parse_page)

Console log:

$ scrapy runspider redirecttest.py -s REDIRECT_ENABLED=0
[scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
[scrapy] INFO: Optional features available: ssl, http11
[scrapy] INFO: Overridden settings: {'REDIRECT_ENABLED': '0'}
[scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
[scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
[scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
[scrapy] INFO: Enabled item pipelines: 
[scrapy] INFO: Spider opened
[scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
[scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/get> (referer: None)
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/get
[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None)
[redirecttest] DEBUG: (parse_page) response: status=302, URL=https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip
[redirecttest] DEBUG: (parse_page) Location header: 'http://httpbin.org/ip'
[scrapy] DEBUG: Crawled (200) <GET http://httpbin.org/ip> (referer: https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip)
[redirecttest] DEBUG: (parse_page) response: status=200, URL=http://httpbin.org/ip
[scrapy] INFO: Closing spider (finished)

Note that you'll need http_handlestatus_list with 302 in it, otherwise, you'll see this kind of log (coming from HttpErrorMiddleware):

[scrapy] DEBUG: Crawled (302) <GET https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip> (referer: None)
[scrapy] DEBUG: Ignoring response <302 https://httpbin.org/redirect-to?url=http%3A%2F%2Fhttpbin.org%2Fip>: HTTP status code is not handled or not allowed

cc by-sa 3.0