Need Proxy?

BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.

Find out more


Scrapy crawl multiple pages error filtered duplicate

Question

Just started using scrapy, and I am trying to do a generic sort of search engine through the whole database page by page and grab certain links that has what I need, but I am getting this error when I try to go to the next page. Not exactly sure how to go to the next page either, would appreciate any help with the correct method!

This is my code:

class TestSpider(scrapy.Spider):

    name = "PLC"
    allowed_domains = ["exploit-db.com"]

    start_urls = [
        "https://www.exploit-db.com/local/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2] + '.html'
        links = response.xpath('//tr/td[5]/a/@href').extract()
        description = response.xpath('//tr/td[5]/a[@href]/text()').extract()    

        for data, link in zip(description, links):
            if "PLC" in data:
                with open(filename, "a") as f:
                    f.write(data+'\n')
                    f.write(link+'\n\n')
                    f.close()


            else:
                pass

        next_page = response.xpath('//div[@class="pagination"][1]//a/@href').extract()
        if next_page:
            url = response.urljoin(next_page[0])
            yield scrapy.Request(url, callback=self.parse)

but I am getting this error(?) on the console

2016-06-08 16:05:21 [scrapy] INFO: Enabled item pipelines:
[]
2016-06-08 16:05:21 [scrapy] INFO: Spider opened
2016-06-08 16:05:21 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-06-08 16:05:21 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-06-08 16:05:22 [scrapy] DEBUG: Crawled (200) <GET https://www.exploit-db.com/robots.txt> (referer: None)
2016-06-08 16:05:22 [scrapy] DEBUG: Crawled (200) <GET https://www.exploit-db.com/local/> (referer: None)
2016-06-08 16:05:23 [scrapy] DEBUG: Crawled (200) <GET https://www.exploit-db.com/local/?order_by=date&order=desc&pg=2> (referer: https://www.exploit-db.com/local/)
2016-06-08 16:05:23 [scrapy] DEBUG: Crawled (200) <GET https://www.exploit-db.com/local/?order_by=date&order=desc&pg=1> (referer: https://www.exploit-db.com/local/?order_by=date&order=desc&pg=2)
2016-06-08 16:05:23 [scrapy] DEBUG: Filtered duplicate request: <GET https://www.exploit-db.com/local/?order_by=date&order=desc&pg=2> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2016-06-08 16:05:23 [scrapy] INFO: Closing spider (finished)
2016-06-08 16:05:23 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1162,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 40695,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 4,
 'dupefilter/filtered': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2016, 6, 8, 8, 5, 23, 514161),
 'log_count/DEBUG': 6,
 'log_count/INFO': 7,
 'request_depth_max': 3,
 'response_received_count': 4,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2016, 6, 8, 8, 5, 21, 561678)}
2016-06-08 16:05:23 [scrapy] INFO: Spider closed (finished)

it cannot crawl the next page and would love an explanation why T.T

Answer

You can use the parameter dont_filter = True in your request:

if next_page:
     url = response.urljoin(next_page[0])
     yield scrapy.Request(url, callback=self.parse, dont_filter=True)

But you will run into an infinite loop, since it seems that your xpath is retrieving the same link twice (check the pager on each page, because the second element of .pagination may not always be the "next page".

next_page = response.xpath('//div[@class="pagination"][1]//a/@href').extract()

Also, what if they start using bootstrap or similar, and they add the classes btn btn-default to the link?

I'd suggest to use

selector.css(".pagination").xpath('.//a/@href')

instead

cc by-sa 3.0