python - 格式化从网站抓取的数据 (BeautifulSoup)

Question

我正在使用 BeautifulSoup 和 Requests 创建一个刮板，它刮取网站页面以获取比赛时间表（和结果，如果可用）。这是我到目前为止所拥有的：

    def getMatches(self):
        url = 'http://icc-cricket.yahoo.net/match_zone/series/fixtures.php?seriesCode=ENG_WI_2012' # change seriesCode in URL for different series.
        page = requests.get(url)
        page_content = page.content
        soup = BeautifulSoup(page_content)

    result = soup.find('div', attrs={'class':'bElementBox'})
    tags = result.findChildren('tr')

    for elem in tags:
        x = elem.getText()
        print x

这些是我得到的结果：

Date &amp; Time (GMT)fixture
Thu, May 17, 2012 10:00 AMEngland&nbsp; vs &nbsp;West Indies
3rd&nbsp;TESTA full scorecard will be available shortly.Venue: Edgbaston,    BirminghamResult: England won by 5 wickets
Fri, May 25, 2012 11:00 AMEngland&nbsp; vs &nbsp;West Indies
2nd&nbsp;TESTClick here for the full scorecardVenue: Trent Bridge, NottinghamResult:     England won by 9 wickets
Thu, Jun 7, 2012 10:00 AMEngland&nbsp; vs &nbsp;West Indies
1st&nbsp;TESTClick here for the full scorecardVenue: Lord'sResult: Match Drawn
Sat, Jun 16, 2012 9:45 AMEngland&nbsp; vs &nbsp;West Indies
1st&nbsp;ODIClick here for the full scorecardVenue: The Rose Bowl, SouthamptonResult:     England won by 114 runs (D/L Method)
Tue, Jun 19, 2012 9:45 AMEngland&nbsp; vs &nbsp;West Indies
2nd&nbsp;ODIVenue: KIA Oval
Fri, Jun 22, 2012 9:45 AMEngland&nbsp; vs &nbsp;West Indies
3rd&nbsp;ODIVenue: Headingley Carnegie
Sun, Jun 24, 2012 12:00 AMEngland&nbsp; vs &nbsp;West Indies
1st&nbsp;T20Venue: Trent Bridge, Nottingham

现在，我想以某种结构化格式对数据进行分类。一个字典列表，每个都包含
有关单个匹配的信息是理想的。但我坚持如何实现这一目标。结果中的输出字符串有类似的字符&nbsp，时间排列很奇怪AMEngland。还有一个问题，如果我使用空格字符作为分隔符来分割字符串，像西印度群岛这样的国家，只有两个单词，将被分割，并且不会有任何统一的方法来解析它。

那么有没有一种方法可以统一解析这些数据，这样我就可以进入表单了。有点像：

[ {'date': match_date, 'home_team': team1, 'away_team': team2, 'venue': venue},{ same for match 2}, { match 3 }...]

我会很感激任何帮助。:)

score 1 · Accepted Answer

区分日期/时间和国家并不难。您可以对“地点”和“结果”执行相同的操作。

>>> import re
>>> s = "Sun, Jun 24, 2012 12:00 AMEngland&nbsp; vs &nbsp;West Indies"
>>> match = re.search(r"\b[AP]M", s)
>>> s[0:match.end()]
'Sun, Jun 24, 2012 12:00 AM'
>>> s[match.end():]
'England&nbsp; vs &nbsp;West Indies'

score 0 · Accepted Answer

看看scrapy吧；这将使这项任务变得容易得多。

您定义要从该站点抓取的项目：

from scrapy.item import Item, Field

class CricketMatch(Item):
    date = Field()
    home_team = Field()
    away_team = Field()
    venue = Field()

然后用 XPath 表达式定义一个加载器来填充这些项目。之后您可以直接使用这些项目，或者生成 JSON 输出或类似的。

python - 格式化从网站抓取的数据 (BeautifulSoup)

2 回答 2

Related

Reference