2

你的小问题:-)

我正在使用 BeautifulSoup 来解析 HTML 页面中表格的内容。问题是在我的输出文件的每一行(CSV/EXCEL)之间,它会拉出一个空行......这是 HTML 表的一个例子(非常大)

<tr><td class="normaltext" valign="TOP">Tesco - United Kingdom&nbsp;&nbsp;</td>
<td class="normaltext"  valign="TOP">CO</td>
<td class="normaltext"  valign="TOP">Unknown&nbsp;&nbsp;</td>
<td class="normaltext"  align="center" valign="top">lol</td></tr>
<tr><td colspan="5"><hr></td></tr>
<tr><td class="normaltext" valign="TOP">Tesco - United Kingdom&nbsp;&nbsp;</td>
<td class="normaltext"  valign="TOP">CO</td>
<td class="normaltext"  valign="TOP">Unknown&nbsp;&nbsp;</td>
<td class="normaltext"  align="center" valign="top">lol</td></tr>
<tr><td colspan="5"><hr></td></tr>

每一个 <tr> 你都有这个:<tr><td colspan="5"><hr></td></tr>所以它在我的 CSV/Excel 表中放了一个空行。我想在 Excel 工作表中提取所有信息,但每行之间没有空行...

这是我使用的脚本:

rows = tableau[3].findAll('tr')
for tr in rows:
    cols = tr.findAll('td', attrs={'class' : 'normaltext'})
    y = 0
    x = x + 1
    for td in cols:
        texte_bu = td.text
        texte_bu = texte_bu.encode('utf-8')
        texte_bu = texte_bu.strip()
        ws.write(x,y,td.text)
        y = y + 1

非常感谢那个能给我提示的人,以便在我的输出文件的每一行之间获取这个*空白无用行的 rib :)

4

1 回答 1

1

解决方案:当你找到一个空行时,然后跳过循环并读取下一行。这样可以避免您在工作簿中写入空行。:)

这是一个工作模拟。我添加了一个外观调整,以避免顶部的空行被发送出去。希望这能让你摆脱空行的烦恼:)

from BeautifulSoup import BeautifulSoup
import xlwt

text = '''<table><tr><td class="normaltext" valign="TOP">Tesco - United Kingdom&nbsp;&nbsp;</td>
<td class="normaltext"  valign="TOP">CO</td>
<td class="normaltext"  valign="TOP">Unknown&nbsp;&nbsp;</td>
<td class="normaltext"  align="center" valign="top">BULATS</td></tr>
<tr><td colspan="5"><hr></td></tr>
<tr><td class="normaltext" valign="TOP">Tesco - United Kingdom&nbsp;&nbsp;</td>
<td class="normaltext"  valign="TOP">CO</td>
<td class="normaltext"  valign="TOP">Unknown&nbsp;&nbsp;</td>
<td class="normaltext"  align="center" valign="top">BULATS</td></tr>
<tr><td colspan="5"><hr></td></tr><table>'''

wb = xlwt.Workbook()
ws = wb.add_sheet('a test sheet')

soup = BeautifulSoup(text)
table = soup.find('table')
rows = table.findAll('tr')
x = 0
for tr in rows:
    cols = tr.findAll('td', attrs={'class' : 'normaltext'})
    if not cols: 
        # when we hit an empty row, we should not print anything to the workbook
        continue
    y = 0
    for td in cols:
        texte_bu = td.text
        texte_bu = texte_bu.encode('utf-8')
        texte_bu = texte_bu.strip()
        ws.write(x, y, td.text)
        print(x, y, td.text)
        y = y + 1
    # update the row pointer AFTER a row has been printed
    # this avoids the blank row at the top of your table
    x = x + 1

wb.save('example.xls')
于 2012-05-01T19:57:50.477 回答