我正在使用 BeautifulSoup 和 Requests 创建一个刮板,它刮取网站页面以获取比赛时间表(和结果,如果可用)。这是我到目前为止所拥有的:
def getMatches(self):
url = 'http://icc-cricket.yahoo.net/match_zone/series/fixtures.php?seriesCode=ENG_WI_2012' # change seriesCode in URL for different series.
page = requests.get(url)
page_content = page.content
soup = BeautifulSoup(page_content)
result = soup.find('div', attrs={'class':'bElementBox'})
tags = result.findChildren('tr')
for elem in tags:
x = elem.getText()
print x
这些是我得到的结果:
Date & Time (GMT)fixture
Thu, May 17, 2012 10:00 AMEngland vs West Indies
3rd TESTA full scorecard will be available shortly.Venue: Edgbaston, BirminghamResult: England won by 5 wickets
Fri, May 25, 2012 11:00 AMEngland vs West Indies
2nd TESTClick here for the full scorecardVenue: Trent Bridge, NottinghamResult: England won by 9 wickets
Thu, Jun 7, 2012 10:00 AMEngland vs West Indies
1st TESTClick here for the full scorecardVenue: Lord'sResult: Match Drawn
Sat, Jun 16, 2012 9:45 AMEngland vs West Indies
1st ODIClick here for the full scorecardVenue: The Rose Bowl, SouthamptonResult: England won by 114 runs (D/L Method)
Tue, Jun 19, 2012 9:45 AMEngland vs West Indies
2nd ODIVenue: KIA Oval
Fri, Jun 22, 2012 9:45 AMEngland vs West Indies
3rd ODIVenue: Headingley Carnegie
Sun, Jun 24, 2012 12:00 AMEngland vs West Indies
1st T20Venue: Trent Bridge, Nottingham
现在,我想以某种结构化格式对数据进行分类。一个字典列表,每个都包含
有关单个匹配的信息是理想的。但我坚持如何实现这一目标。结果中的输出字符串有类似的字符 
,时间排列很奇怪AMEngland
。还有一个问题,如果我使用空格字符作为分隔符来分割字符串,像西印度群岛这样的国家,只有两个单词,将被分割,并且不会有任何统一的方法来解析它。
那么有没有一种方法可以统一解析这些数据,这样我就可以进入表单了。有点像:
[ {'date': match_date, 'home_team': team1, 'away_team': team2, 'venue': venue},{ same for match 2}, { match 3 }...]
我会很感激任何帮助。:)