您可以从爬虫中发送 json 数据并获取结果。可以按如下方式完成:
拥有蜘蛛:
class MySpider(scrapy.Spider):
# some attributes
accomulated=[]
def parse(self, response):
# do your logic here
page_text = response.xpath('//text()').extract()
for text in page_text:
if conditionsAreOk( text ):
self.accomulated.append(text)
def closed( self, reason ):
# call when the crawler process ends
print JSON.dumps(self.accomulated)
编写一个 runner.py 脚本,如:
import sys
from twisted.internet import reactor
import scrapy
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.utils.project import get_project_settings
from spiders import MySpider
def main(argv):
url = argv[0]
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s', 'LOG_ENABLED':False })
runner = CrawlerRunner( get_project_settings() )
d = runner.crawl( MySpider, url=url)
# For Multiple in the same process
#
# runner.crawl('craw')
# runner.crawl('craw2')
# d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run() # the script will block here until the crawling is finished
if __name__ == "__main__":
main(sys.argv[1:])
然后从你的 main.py 调用它:
import json, subprocess, sys, time
def main(argv):
# urlArray has http:// or https:// like urls
for url in urlArray:
p = subprocess.Popen(['python', 'runner.py', url ], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = p.communicate()
# do something with your data
print out
print json.loads(out)
# This just helps to watch logs
time.sleep(0.5)
if __name__ == "__main__":
main(sys.argv[1:])
笔记
如您所知,这不是使用 Scrapy 的最佳方式,但对于不需要复杂后期处理的快速结果,此解决方案可以满足您的需求。
我希望它有所帮助。