python - Scrapy 项目，抓取时间表

Question

所以我试图在这个页面上抓取时间表.. http://stats.swehockey.se/ScheduleAndResults/Schedule/3940

..使用此代码。

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

class SchemaSpider(BaseSpider):
    name = "schema"
    allowed_domains = ["http://stats.swehockey.se/"]
    start_urls = [
        "http://stats.swehockey.se/ScheduleAndResults/Schedule/3940"
    ]

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    rows = hxs.select('//table[@class="tblContent"]/tbody/tr')

    for row in rows:
        date = row.select('/td[1]/div/span/text()').extract()
        teams = row.select('/td[2]/text()').extract()

        print date, teams

但我无法让它工作。我究竟做错了什么？几个小时以来，我一直试图弄清楚自己，但我不知道为什么我的 XPath 不能正常工作。

score 1 · Accepted Answer

两个问题：

tbody是现代浏览器添加的标签。Scrapy 在 html 中根本看不到它。
数据和团队的 xpath 不正确：您应该使用相对 xpath ( .//)，td 索引也是错误的，应该是 2 和 3 而不是 1 和 2

这是带有一些修改（工作）的整个代码：

from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class SchemaItem(Item):
    date = Field()
    teams = Field()


class SchemaSpider(BaseSpider):
    name = "schema"
    allowed_domains = ["http://stats.swehockey.se/"]
    start_urls = [
        "http://stats.swehockey.se/ScheduleAndResults/Schedule/3940"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@class="tblContent"]/tr')

        for row in rows:
            item = SchemaItem()
            item['date'] = row.select('.//td[2]/div/span/text()').extract()
            item['teams'] = row.select('.//td[3]/text()').extract()

            yield item

希望有帮助。

python - Scrapy 项目，抓取时间表

1 回答 1

Related

Reference