2

JsonItemExporter我已经编写了一个使用 ScrapingHub 中的以下 Spider Settings 将数据导出到 AWS S3的 scrapy scraper

AWS_ACCESS_KEY_ID = AAAAAAAAAAAAAAAAAAAA
AWS_SECRET_ACCESS_KEY = Abababababababababababababababababababab
FEED_FORMAT = json
FEED_URI = s3://scraper-dexi/my-folder/jobs-001.json

我需要做的是在输出文件上动态设置日期/时间,如果它使用这样的日期和时间格式,我会喜欢它,jobs-20171215-1000.json但我不知道如何使用 scrapinghub 设置动态 FEED_URI。

网上信息不多,我能找到的唯一例子是在抓取中心网站上,但不幸的是它不起作用。

当我根据文档中的示例应用这些设置时

AWS_ACCESS_KEY_ID = AAAAAAAAAAAAAAAAAAAA
AWS_SECRET_ACCESS_KEY = Abababababababababababababababababababab
FEED_FORMAT = json
FEED_URI = s3://scraper-dexi/my-folder/jobs-%(time).json

注意我的 URI 中的 %(time)

抓取失败并出现以下错误

[scrapy.utils.signal] Error caught on signal handler: <bound method ?.open_spider of <scrapy.extensions.feedexport.FeedExporter object at 0x7fd11625d410>> Less
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/usr/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 190, in open_spider
    uri = self.urifmt % self._get_uri_params(spider)
ValueError: unsupported format character 'j' (0x6a) at index 53

[scrapy.utils.signal] Error caught on signal handler: <bound method ?.item_scraped of <scrapy.extensions.feedexport.FeedExporter object at 0x7fd11625d410>> Less
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/site-packages/twisted/internet/defer.py", line 150, in maybeDeferred
    result = f(*args, **kw)
  File "/usr/local/lib/python2.7/site-packages/pydispatch/robustapply.py", line 55, in robustApply
    return receiver(*arguments, **named)
  File "/usr/local/lib/python2.7/site-packages/scrapy/extensions/feedexport.py", line 220, in item_scraped
    slot = self.slot
AttributeError: 'FeedExporter' object has no attribute 'slot'
4

1 回答 1

2

我误解了文档中的重要性,s没有意识到它是令牌签名的一部分。

我改变了

FEED_URI = s3://scraper-dexi/my-folder/jobs-%(time).json

FEED_URI = s3://scraper-dexi/my-folder/jobs-%(time)s.json

根据文档并解决了问题

%(时间)

变成

%(时间)s

于 2017-12-17T23:08:25.203 回答