-1

我正在尝试抓取这个网站: http: //stats.swehockey.se/ScheduleAndResults/Schedule/3940

我已经(感谢 alecxe)检索日期和团队。

from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class SchemaItem(Item):
    date = Field()
    teams = Field()


class SchemaSpider(BaseSpider):
    name = "schema"
    allowed_domains = ["http://stats.swehockey.se/"]
    start_urls = [
        "http://stats.swehockey.se/ScheduleAndResults/Schedule/3940"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@class="tblContent"]/tr')

        for row in rows:
            item = SchemaItem()
            item['date'] = row.select('.//td[2]/div/span/text()').extract()
            item['teams'] = row.select('.//td[3]/text()').extract()

            yield item

所以,我的下一步是过滤掉任何不是“AIK”或“Djurgårdens IF”主场比赛的东西。之后,我需要重新格式化为可以添加到 Google 日历的 .ics 文件。

编辑:所以我已经解决了一些事情,但还有很多事情要做。我的代码现在看起来像这样..

# -*- coding: UTF-8 -*-
from scrapy.item import Item, Field
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class SchemaItem(Item):
    date = Field()
    teams = Field()


class SchemaSpider(BaseSpider):
    name = "schema"
    allowed_domains = ["http://stats.swehockey.se/"]
    start_urls = [
        "http://stats.swehockey.se/ScheduleAndResults/Schedule/3940"
    ]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@class="tblContent"]/tr')

        for row in rows:
            item = SchemaItem()
            item['date'] = row.select('.//td[2]/div/span/text()').extract()
            item['teams'] = row.select('.//td[3]/text()').extract()

            for string in item['teams']:

                teams = string.split('-') #split it

                home_team = teams[0]#.split(' ') #only the first name, e.g. just 'Djurgårdens' out of 'Djurgårdens IF'
                away_team = teams[1]
                #home_team[0] = home_team[0].replace(" ", "") #remove whitespace
                #home_team = home_team[0]

                if "AIK" in home_team:
                    for string in item['date']:
                            year = string[0:4]
                            month = string[5:7]
                            day = string[8:10]
                            hour = string[11:13]
                            minute = string[14:16]

                            print year, month, day, hour, minute, home_team, away_team  
                elif u"Djurgårdens" in home_team:
                    for string in item['date']:
                        year = string[0:4]
                        month = string[5:7]
                        day = string[8:10]
                        hour = string[11:13]
                        minute = string[14:16]

                        print year, month, day, hour, minute, home_team, away_team     

该代码打印出“AIK”、“Djurgårdens IF”和“Skellefteå AIK”的游戏。所以我的问题显然是如何过滤掉“Skellefteå AIK”游戏,以及是否有任何简单的方法可以让这个程序变得更好。对此有什么想法?

此致!

4

2 回答 2

1

需要注意的几点:

  1. string是一种内置类型,因此通常最好避免将其用于您自己的变量
  2. 删除空格确实是清理home_team足以与所需的“AIK”进行直接比较的方法。我用过string.strip(),因为它比它干净一点,home_team但这是个人的事情away_teamstring.replace(" ", "")
  3. 我还在print线路中的主客队之间添加了一个“:”,以便在我测试时更清楚地区分它们,所以请随意摆脱这种变化

检查一下,如果还有其他问题,请告诉我。:)

   def parse(self, response):
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@class="tblContent"]/tr')

        for row in rows:
            item = SchemaItem()
            item['date'] = row.select('.//td[2]/div/span/text()').extract()
            item['teams'] = row.select('.//td[3]/text()').extract()

            for fixture in item['teams']:
                teams = fixture.split('-') #split it
                home_team = teams[0].strip()
                away_team = teams[1].strip()

                if home_team == "AIK":
                    for fixDate in item['date']:
                            year = fixDate[0:4]
                            month = fixDate[5:7]
                            day = fixDate[8:10]
                            hour = fixDate[11:13]
                            minute = fixDate[14:16]
                            print year, month, day, hour, minute, home_team, ":", away_team
                elif home_team == u"Djurgårdens IF":
                    for fixDate in item['date']:
                        year = fixDate[0:4]
                        month = fixDate[5:7]
                        day = fixDate[8:10]
                        hour = fixDate[11:13]
                        minute = fixDate[14:16]
                        print year, month, day, hour, minute, home_team, ":", away_team
于 2013-09-14T19:25:51.067 回答
1

我只是猜测主场比赛是您首先要寻找的球队的比赛(在破折号之前)。

您可以在 XPath 中或从 python 中执行此操作。如果您想在 XPath 中执行此操作,只需选择包含主队名称的行。

//table[@class="tblContent"]/tr[
    contains(substring-before(.//td[3]/text(), "-"), "AIK")
  or
    contains(substring-before(.//td[3]/text(), "-"), "Djurgårdens IF")
]

您可以节省地删除所有空格(包括换行符),我只是为了便于阅读而添加了它们。

对于 python,你应该能够做很多相同的事情,使用一些正则表达式可能会更简洁。

于 2013-09-13T22:11:01.880 回答