python - 试图从脚本中的另一个位置运行爬虫

Question

全部，

我正在尝试完全自动化我的抓取，它由 3 个步骤组成：

1-获取广告的索引页面列表（非scrapy工作，由于各种原因）2-从第一步获得的索引页面中获取广告URL列表（Scrapy工作）

我的scrapy项目在通常的目录中：

C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders\GetAdUrls_spider.py（“GetAdUrls_spider”文件中的蜘蛛名称是（name =“getadurls”））

我的自动化步骤 1 和 2 的脚本位于此目录中：

C:\Website_DATA\SCRIPTS\StepByStepLauncher.py

我尝试使用 Scrapy 文档导入爬虫并使用以下代码从脚本内部运行：

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls

spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

不幸的是，当我尝试运行此脚本时，我不断收到错误“没有名为 GetAdUrlsFromIndex.spiders.GetAdUrls_spider 的模块”。我尝试将工作目录更改为几个不同的位置，我玩弄了名称，似乎没有任何工作..

将不胜感激任何帮助.. 谢谢！

score -1 · Accepted Answer

如果您确实有__init__.py，C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex然后C:\Python27\Scripts\GetAdUrlsFromIndex_project\GetAdUrlsFromIndex\spiders尝试以这种方式修改您的脚本

import sys
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log

sys.path.append('C:/Python27/Scripts/GetAdUrlsFromIndex_project')
from GetAdUrlsFromIndex.spiders.GetAdUrls_spider import getadurls

spider = getadurls(domain='website.com')
crawler = Crawler(Settings())
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here

python - 试图从脚本中的另一个位置运行爬虫

1 回答 1

Related

Reference