我正在使用Python 2.7和Beautiful Soup 3.2抓取网站。我对这两种语言都很陌生,但是从文档中我开始了一些。
我正在阅读下一个文档: http : //www.crummy.com/software/BeautifulSoup/bs3/documentation.html#contents http://thepcspy.com/read/scraping-websites-with-python/
我现在所做的和拥有的(失败的部分):
# Import the classes that are needed
import urllib2
from BeautifulSoup import BeautifulSoup
# URL to scrape and open it with the urllib2
url = 'http://www.wiziwig.tv/competition.php?competitionid=92&part=sports&discipline=football'
source = urllib2.urlopen(url)
# Turn the saced source into a BeautifulSoup object
soup = BeautifulSoup(source)
# From the source HTML page, search and store all <td class="home">..</td> and it's content
hometeamsTd = soup.findAll('td', { "class" : "home" })
# Loop through the tag and store only the needed information, being the home team
hometeams = [tag.contents[1] for tag in hometeamsTd]
# From the source HTML page, search and store all <td class="home">..</td> and it's content
awayteamsTd = soup.findAll('td', { "class" : "away" })
# Loop through the tag and store only the needed information, being the away team
awayteams = [tag.contents[1] for tag in awayteamsTd]
hometeamsTdtag.contents
的内容如下所示:
[
[<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Harkemase Boys', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6077" />],
[<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'RKC Waalwijk', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-427" />],
[<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Dutch KNVB Beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />],
[<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'PSV', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-3" />],
[<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Ajax', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-2" />],
[<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Dutch KNVB Beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />],
[<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'SC Heerenveen', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-14" />],
[<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Feyenoord', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-9" />],
[<img class="flag" src="/gfx/flags/nl.gif" alt="nl" />, u'Dutch KNVB Beker', <img src="/gfx/favourite_off.gif" alt="fav icon" class="fav off" id="team-6758" />]
]
awayteamsTdtag.contents
的内容如下所示:
[
[u'Away-team'],
[<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-13" />, u'NEC', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],
[<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-11" />, u'Heracles', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],
[<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-428" />, u'Stormvogels Telstar', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],
[<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-419" />, u'FC Volendam', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],
[<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-7" />, u'FC Twente', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />],
[<img src="/gfx/favourite_off.gif" class="fav off" alt="fav icon" id="team-415" />, u'FC Dordrecht', <img class="flag" src="/gfx/flags/nl.gif" alt="nl" />]
]
我试图解决但还没有完全解决的问题是:
- 代码
awayteams = [tag.contents[1] for tag in awayteamsTd]
通过错误:IndexError: list index out of range
。这当然是正确的,因为正如您在tag.contents
for awayteamsTd的输出中看到的那样,有一个 first entry[u'Away-team']
。这就是它失败的原因。但是我怎样才能删除/跳过这个? - 在主队输出中,一切正常,但我想排除出现荷兰 KNVB Beker文本的那些