python - 实时刮板 | 复杂的问题

Question

我有一个正在运行的网络爬虫；代码列在单独的 Excel 文档中。我正在使用 ScrapingHub 的 API，因为它可以在任何地方访问，并且提供了一个很大的便利因素。我想创建一个代码，该代码将更新并从 Excel 表上列出的内容中删除。

使用我的 excel 列表，我怎样才能让我的代码自动更新（即，我将 MSFT 添加到我的 excel 表中，以便更新我的代码以包含 MSFT）？

另外，有没有让它自动部署？

--==Spider Code==-- **在每个链接中附加的代码（搜索条件）

import scrapy
import collections

from collections import OrderedDict
from scrapy.spiders import XMLFeedSpider
from tickers.items import tickersItem
class Spider(XMLFeedSpider):
    name = "NewsScraper"
    allowed_domains = ["yahoo.com"]
    start_urls = (
        'https://feeds.finance.yahoo.com/rss/2.0/headline?s=ABIO,ACFN,AEMD,AEZS,AITB',
        'https://feeds.finance.yahoo.com/rss/2.0/headline?s=BGMD,BIOA',
        'https://feeds.finance.yahoo.com/rss/2.0/headline?s=CANF,CBIO,CCCR',
        'https://feeds.finance.yahoo.com/rss/2.0/headline?s=DRIO,DRWI,DXTR,ENCR',
        'https://feeds.finance.yahoo.com/rss/2.0/headline?s=GNMX,GNUS,GPL,HIPP,HSGX',
        'https://feeds.finance.yahoo.com/rss/2.0/headline?s=MBOT,MBVX',
        'https://feeds.finance.yahoo.com/rss/2.0/headline?s=NBY,NNVC,NTRP',
        'https://feeds.finance.yahoo.com/rss/2.0/headline?s=PGRX,PLXP',
        'https://feeds.finance.yahoo.com/rss/2.0/headline?s=SANW,SBOT,SCON,SCYX',
        'https://feeds.finance.yahoo.com/rss/2.0/headline?s=UNXL,UQM,URRE',
                  )
    itertag = 'item'
    def parse_node(self, response, node):
        item = collections.OrderedDict()
        item['Title'] = node.xpath(
            'title/text()').extract_first()
        item['PublishDate'] = node.xpath(
            'pubDate/text()').extract_first()
        item['Description'] = node.xpath(
            'description/text()').extract_first()      
        item['Link'] = node.xpath(
            'link/text()').extract_first()
        return item

python - 实时刮板 | 复杂的问题

0 回答 0

Related

Reference