python - 在 python 中拉链接并抓取这些页面

Question

我想从这个页面上抓取一些链接。

http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/wnba/teams/pastresults/2012/team665231.html

这得到了我想要的链接。

boxurl = urllib2.urlopen(url).read()
soup = BeautifulSoup(boxurl)
boxscores = soup.findAll('a', href=re.compile('boxscore'))

我想从页面上刮掉每个 boxscore。我已经编写了代码来抓取 boxscore，但我不知道如何获取它们。

编辑

我想这种方式会更好，因为它会去除 html 标签。我仍然需要知道如何打开它们。

for link in soup.find_all('a', href=re.compile('boxscore')):
    print(link.get('href'))

edit2： 这就是我从页面的第一个链接中抓取一些数据的方式。

url = 'http://www.covers.com/pageLoader/pageLoader.aspx?page=/data/wnba/results/2012/boxscore841602.html'


boxurl = urllib2.urlopen(url).read()
soup = BeautifulSoup(boxurl)
def _unpack(row, kind='td'):
    return [val.text for val in row.findAll(kind)]

tables = soup('table')
linescore = tables[1]   
linescore_rows = linescore.findAll('tr')
roadteamQ1 = float(_unpack(linescore_rows[1])[1])
roadteamQ2 = float(_unpack(linescore_rows[1])[2])
roadteamQ3 = float(_unpack(linescore_rows[1])[3])
roadteamQ4 = float(_unpack(linescore_rows[1])[4]) 

print roadteamQ1, roadteamQ2, roadteamQ3, roadteamQ4

但是，当我尝试这个时。

url = 'http://www.covers.com/pageLoader/pageLoader.aspx?    page=/data/wnba/teams/pastresults/2012/team665231.html'
boxurl = urllib2.urlopen(url).read()
soup = BeautifulSoup(boxurl)

tables = pages[0]('table')
linescore = tables[1]   
linescore_rows = linescore.findAll('tr')
roadteamQ1 = float(_unpack(linescore_rows[1])[1])
roadteamQ2 = float(_unpack(linescore_rows[1])[2])
roadteamQ3 = float(_unpack(linescore_rows[1])[3])
roadteamQ4 = float(_unpack(linescore_rows[1])[4])

我得到这个错误。 表 = pages0 类型错误 ：“str”对象不可调用

print pages[0]

像往常一样吐出第一个链接的所有 html。希望这不会太令人困惑。总而言之，我现在可以获得链接，但仍然可以从中获取。

score 1 · Accepted Answer

这样的事情将找到的链接的所有页面拉到一个数组中，所以第一页是pages[0]，第二个pages[1]等

boxscores = soup.findAll('a', href=re.compile('boxscore'))
basepath =  "http://www.covers.com"
pages=[]
for a in boxscores:
   pages.append(urllib2.urlopen(basepath + a['href']).read())

python - 在 python 中拉链接并抓取这些页面

1 回答 1

Related

Reference