多亏了beautifulsoup,我已经抓取了网络数据,但是我无法将输出转换为我可以操作的矩阵/数组。
from bs4 import BeautifulSoup
import urllib2
headers = { 'User-Agent' : 'Mozilla/5.0' }
req = urllib2.Request('http://statsheet.com/mcb/teams/duke/game_stats', None, headers)
html = urllib2.urlopen(req).read()
soup = BeautifulSoup(html)
#statdiv = soup.find('div', attrs={'id': 'basic_stats'}) #not needed
table = soup.find('table', attrs={'class': 'sortable statsb'})
rows = table.findAll('tr')
for tr in rows:
text = []
cols = tr.findAll('td')
for td in cols:
try:
text = ''.join(td.find(text=True))
except Exception:
text = "000"
print text+",",
print
注意:''.join(td.find(text=True))
是为了防止程序在空白单元格上失败。
输出:
W, GSU, 32, 42, 74, 24-47, 51.1, 15-23, 65.2, 11-24, 45.8, 6, 25, 31, 17, 4, 6, 15, 19,
W, UK, 33, 42, 75, 26-57, 45.6, 15-22, 68.2, 8-18, 44.4, 11, 20, 31, 16, 6, 6, 8, 17,
W, FGCU, 52, 36, 88, 30-63, 47.6, 19-23, 82.6, 9-31, 29.0, 16, 21, 37, 19, 9, 4, 18, 14,
W, @MINN, 40, 49, 89, 30-55, 54.5, 21-26, 80.8, 8-10, 80.0, 10, 22, 32, 12, 12, 4, 15, 21,
W, VCU, 29, 38, 67, 20-48, 41.7, 24-27, 88.9, 3-15, 20.0, 4, 30, 34, 14, 4, 8, 8, 18,
W, Lville, 36, 40, 76, 24-55, 43.6, 23-27, 85.2, 5-20, 25.0, 8, 25, 33, 13, 8, 6, 14, 20,
W, OSU, 23, 50, 73, 24-51, 47.1, 20-27, 74.1, 5-12, 41.7, 8, 29, 37, 11, 3, 5, 8, 19,
这是完美的,只是现在我无法弄清楚如何将数据放入矩阵中,以便我可以操作某些列,添加新列等。
我一直在玩 numpy 但每次我尝试都会得到这样的结果:
[u'W,']
[u'GSU,']
[u'32,']
[u'42,']
[u'74,']
[u'24-47,']
[u'51.1,']
[u'15-23,']
[u'65.2,']
[u'11-24,']
[u'45.8,']
我想要的是获取我抓取的数据并能够添加列、移动列、更改列中的文本、将一列中的数据拆分为两列(连字符列)。
这是我使用python的第二天。我假设将我的数据放入矩阵/数组中是最简单的方法。如果不是,请告诉我。