Need Proxy?

BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.

Find out more


Scrapy: crawl multiple spiders sharing same items, pipeline, and settings but with separate outputs

Question

I am trying to run multiple spiders using a Python script based on the code provided in the official documentation. My scrapy project contains multiple spider (Spider1, Spider2, etc.) which crawl different websites and save the content of each website in a different JSON file (output1.json, output2.json, etc.).

The items collected on the different websites share the same structure, therefore the spiders use the same item, pipeline, and setting classes. The output is generated by a custom JSON class in the pipeline.

When I run the spiders separately they work as expected, but when I use the script below to run the spiders from with scrapy API the items get mixed in the pipeline. Output1.json should only contain items crawled by Spider1, but it also contains the items of Spider2. How can I crawl multiple spiders with scrapy API using same items, pipeline, and settings but generating separate outputs?

Here is the code I used to run multiple spiders:

import scrapy
from scrapy.crawler import CrawlerProcess
from web_crawler.spiders.spider1 import Spider1
from web_crawler.spiders.spider2 import Spider2

settings = get_project_settings()
process = CrawlerProcess(settings)
process.crawl(Spider1)
process.crawl(Spider2)
process.start()

Example output1.json:

{
"Name": "Thomas"
"source": "Spider1"
}
{
"Name": "Paul"
"source": "Spider2"
}
{
"Name": "Nina"
"source": "Spider1"

}

Example output2.json:

{
"Name": "Sergio"
"source": "Spider1"
}
{
"Name": "David"
"source": "Spider1"
}
{
"Name": "James"
"source": "Spider2"
}

Normally, all the names crawled by spider1 ("source": "Spider1") should be in output1.json, and all the names crawled by spider2 ("source": "Spider2") should be in output2.json

Thank you for your help!

Answer

Acording to docs to run spiders sequentially on the same process, you must chain deferreds.

Try this:

import scrapy
from scrapy.crawler import CrawlerRunner
from web_crawler.spiders.spider1 import Spider1
from web_crawler.spiders.spider2 import Spider2

settings = get_project_settings()
runner = CrawlerRunner(settings)

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(Spider1)
    yield runner.crawl(Spider2)
    reactor.stop()

crawl()
reactor.run()

cc by-sa 3.0