我正在尝试以相同的方式从 Excel 文件中的链接获取表格,我正在使用下面的代码获取表格
#Getting particular table from the page and sending to excel file
page = urllib2.urlopen('http://developer.android.com/about/dashboards/index.html').read()
soup = BeautifulSoup(page)
a = soup('div', {'class' : 'col-5'})[0]
with open('android version 2013_01_18.csv', 'wb') as csvfile:
csvout = csv.writer(csvfile, delimiter=',')
csvout.writerow(["Version","Codename","API", "Distribution"])
for table in a.findAll('table'):
print '#'
print '# Table'
print '# Fields: ' + ','.join([tr.text for tr in table.findAll('th')])
for row in table.findAll('tr'):
csvout.writerow([tr.text for tr in row.findAll('td')])
我在excel中得到输出:
1.6 Donut 4 0.20%
2.1 Eclair 7 2.40%
2.2 Froyo 8 9.00%
"2.3 - 2.3.2
" Gingerbread 9 0.20%
"2.3.3 - 2.3.7
" 10 47.40%
3.1 Honeycomb 12 0.40%
3.2 13 1.10%
4.0.3 - 4.0.4 Ice Cream Sandwich 15 29.10%
4.1 Jelly Bean 16 9.00%
4.2 17 1.20%
这里的问题在于合并单元格后立即出现的行,因为 td 计数是 3 而不是 4 我发现代码中使用了创建合并单元格 rowspan=2 ,但我想知道如何使用这些信息来获取数据完全一样,下面是HTML结构
<tr>
<td>
<a href="/about/versions/android-2.3.html">2.3 - 2.3.2</a>
</td>
<td rowspan="2">Gingerbread</td>
<td>9</td>
<td>0.2%</td>
</tr>
<tr>
<td>
<a href="/about/versions/android-2.3.3.html">2.3.3 - 2.3.7 </a>
</td>
<td>10</td>
<td>47.4%</td>
</tr>