Need Proxy?

BotProxy: Rotating Proxies Made for professionals. Really fast connection. Built-in IP rotation. Fresh IPs every day.

Find out more


Scrapy: How to run spider from other python script twice or more?

Question

Scrapy version: 1.0.5

I have searched for long time, but most of workarounds don't work in current Scrapy version.

My spider is defined in jingdong_spider.py, and the interface(learn it by Scrapy Documentation) to run spider is below:

# interface
def search(keyword):
    configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
    runner = CrawlerRunner()
    d = runner.crawl(JingdongSpider,keyword)
    d.addBoth(lambda _: reactor.stop())
    reactor.run() # the script will block here until the crawling is finished

Then in temp.py I will call the search(keyword) above to run spider.

Now the problem: I called search(keyword) once, and it worked well.But I called it twice, for instance,

in temp.py

search('iphone')
search('ipad2')

it reported:

Traceback (most recent call last): File "C:/Users/jiahao/Desktop/code/bbt_climb_plus/temp.py", line 7, in search('ipad2') File "C:\Users\jiahao\Desktop\code\bbt_climb_plus\bbt_climb_plus\spiders\jingdong_spider.py", line 194, in search reactor.run() # the script will block here until the crawling is finished File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1193, in run self.startRunning(installSignalHandlers=installSignalHandlers) File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 1173, in startRunning ReactorBase.startRunning(self) File "C:\Python27\lib\site-packages\twisted\internet\base.py", line 684, in startRunning raise error.ReactorNotRestartable() twisted.internet.error.ReactorNotRestartable

The first search(keyword) succeeded, but the latter got wrong.

Could you help?

Answer

In your code sample you are making calls to twisted.reactor starting it on every function call. This is not working because there is only one reactor per process and you cannot start it twice.

There are two ways to solve your problem, both described in documentation here. Either stick with CrawlerRunner but move reactor.run() outside your search() function to ensure it is only called once. Or use CrawlerProcess and simply call crawler_process.start(). Second approach is easier, your code would look like this:

from scrapy.crawler import CrawlerProcess
from dirbot.spiders.dmoz import DmozSpider

def search(runner, keyword):
    return runner.crawl(DmozSpider, keyword)

runner = CrawlerProcess()
search(runner, "alfa")
search(runner, "beta")
runner.start()

cc by-sa 3.0