1

我正在尝试使用 scrapy + MongoDB (PyMongo) 抓取 Spider,但收到错误消息:name must be an instance of basestring。

由于我的 Spider 正在工作,因为它正在将数据抓取到 json 中,我猜错误在我的新管道中,这里是源代码:

import pymongo

from scrapy import log
from scrapy.conf import settings
from scrapy.exceptions import DropItem


class MongoDBPipeline(object):
    def __init__(self):
        self.server = settings['localhost']
        self.port = settings['27017']
        self.db = settings['IngressoRapido']
        self.col = settings['Shows']
        connection = pymongo.Connection(self.server, self.port)
        db = connection[self.db]
        self.collection = db[self.col]

    def process_item(self, item, spider):
        err_msg = ''
        for banda, local in item.items():
            if not local    :
                err_msg += 'Faltando local %s da banda %s\n' % (banda, item['banda'])
        if err_msg:
            raise DropItem(err_msg)
        self.collection.insert(dict(item))
        log.msg('Item written to MongoDB database %s/%s' % (self.db, self.col),
        level=log.DEBUG, spider=spider)
        return item
4

2 回答 2

3

似乎您打算连接到 localhost 端口 27017,但您使用这些值作为键来从设置中获取值。你的意思是这个吗?

 def __init__(self):
    self.server = 'localhost'
    self.port = '27017'
    self.db = 'IngressoRapido'
    self.col = 'Shows'
于 2013-08-29T17:11:48.697 回答
0

以下代码完美运行并正确处理清理资源。可以使用 from_crawler 方法提取设置。

class MongoPipeline(object):
'''
    Saves the scraped item to mongodb.
'''
def __init__(self, mongo_server, mongo_port, mongo_db, mongo_collection):
    self.mongo_server = mongo_server
    self.mongo_port = mongo_port
    self.mongo_db = mongo_db
    self.mongo_collection = mongo_collection

@classmethod
def from_crawler(cls, crawler):
    return cls(
        mongo_server=crawler.settings.get('MONGODB_SERVER'),
        mongo_port=crawler.settings.get('MONGODB_PORT'),
        mongo_db=crawler.settings.get('MONGODB_DB'),
        mongo_collection=crawler.settings.get('MONGODB_COLLECTION'),
    )

def open_spider(self, spider):
    self.client = pymongo.MongoClient(self.mongo_server, self.mongo_port)
    self.db = self.client[self.mongo_db]

def close_spider(self, spider):
    self.client.close()

def process_item(self, item, spider):
    self.db[self.mongo_collection].insert(dict(item))
    return item

注意:请在 piplines.py 中导入 pymongo。

请检查官方文档是否相同。http://doc.scrapy.org/en/latest/topics/item-pipeline.html#write-items-to-mongodb

于 2016-06-16T08:29:35.980 回答