python - Scrapy Item pipeline for multi spiders

Question

I have 2 spiders and run it here:

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

settings = get_project_settings()

process1 = CrawlerProcess(settings)
process1.crawl('spider1')
process1.crawl('spider2')

process1.start()

and I want these spiders write a common file.

This is Pipeline class:

class FilePipeline(object):

    def __init__(self):
        self.file  = codecs.open('data.txt', 'w', encoding='utf-8')
        self.spiders = []

    def open_spider(self, spider):
        self.spiders.append(spider.name)

    def process_item(self, item, spider):
        line = json.dumps(OrderedDict(item), ensure_ascii=False, sort_keys=False) + "\n"
        self.file.write(line)

        return item

    def spider_closed(self, spider):
        self.spiders.remove(spider.name)
        if len(self.spiders) == 0:
            self.file.close()

But although I don't get error message, when all spiders are done writing in the common file i have less lines (item) than the scrapy log does. A few lines are cut. Maybe there is some practice writing in one file simultaneously from two spiders?

UPDATE:

Thanks, everybody!) I implemented it this way:

class FilePipeline1(object):
    lock = threading.Lock()
    datafile = codecs.open('myfile.txt', 'w', encoding='utf-8')

    def __init__(self):
        pass

    def open_spider(self, spider):
        pass

    def process_item(self, item, spider):
        line = json.dumps(OrderedDict(item), ensure_ascii=False, sort_keys=False) + "\n"
        try:
            FilePipeline1.lock.acquire()
            if isinstance(item, VehicleItem):            
                FilePipeline1.datafile.write(line)
        except:
            pass
        finally:
            FilePipeline1.lock.release()

        return item

    def spider_closed(self, spider):
        pass

score 2 · Accepted Answer

I agree with A. Abramov's answer.

Here is just an idea I had. You could create two tables in a DB of your choice and then merge them after both spiders are done crawling. You would have to keep track of the time the logs came in so you can order your logs based on time received. You could then dump the db into whatever file type you would like. This way, the program doesn't have to wait for one process to complete before writing to the file and you don't have to do any multithreaded programming.

UPDATE:

Actually, depending on how long your spiders are running, you could just store the log output and the time into a dictionary. Where the time are the keys and log output are the values. This would be easier than initializing a db. You could then dump the dict into your file in order by keys.

score 1 · Accepted Answer

Both of the spiders you have in seperate threads write to the file simultaniously. That will lead to problems such as the lines cutting out and some of them missing if you dont take care of syncronization, as the past says. In order to do it, you need to either synchronize file access and only write whole record/lines, or to have a strategy for allocating regions of the file to different threads e.g. re-building a file with known offsets and sizes, and by default you have neither of these. Generally, writing in the same time from two different threads into the same file is not a common method, and unless you really know what you're doing, I dont advise you to do so.

Instead, i'd seperate the spiders IO functions, and wait for one's action to finish before I start the other - considering your threads arn't syncronized, it will both make the program more efficient & make it work :) If you want a code example of how to do this in your context, just ask for it and I'll happily provide it.

python - Scrapy Item pipeline for multi spiders

2 回答 2

Related

Reference