我是 Scrapy 的新手,也是真正的 python 新手。我正在尝试编写一个抓取工具,它将提取文章标题、链接和文章描述,几乎就像网页中的 RSS 提要一样,以帮助我完成论文。我编写了以下刮板,当我运行它并将其导出为 .txt 时,它会返回空白。我相信我需要添加一个项目加载器,但我并不积极。
项目.py
from scrapy.item import Item, Field
class NorthAfricaItem(Item):
title = Field()
link = Field()
desc = Field()
pass
蜘蛛
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from northafricatutorial.items import NorthAfricaItem
class NorthAfricaItem(BaseSpider):
name = "northafrica"
allowed_domains = ["http://www.north-africa.com/"]
start_urls = [
"http://www.north-africa.com/naj_news/news_na/index.1.html",
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = NorthAfricaItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
更新
感谢 Talvalin 的帮助,另外还有一些混乱,我能够解决问题。我正在使用我在网上找到的股票脚本。然而,一旦我使用了 shell,我就能够找到正确的标签来获得我需要的东西。我最终得到:
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from northafrica.items import NorthAfricaItem
class NorthAfricaSpider(BaseSpider):
name = "northafrica"
allowed_domains = ["http://www.north-africa.com/"]
start_urls = [
"http://www.north-africa.com/naj_news/news_na/index.1.html",
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = NorthAfricaItem()
item['title'] = site.select('//div[@class="short_holder"] /h2/a/text()').extract()
item['link'] = site.select('//div[@class="short_holder"]/h2/a/@href').extract()
item['desc'] = site.select('//span[@class="summary"]/text()').extract()
items.append(item)
return items
如果有人在这里看到任何我做错了的东西,请告诉我......但它有效。