python - 使用 BeautifulSoup 读取 html 表格内容感到困惑？

Question

这是HTML内容：

<table cellspacing="1" cellpadding="0" class="data">
<tr class="colhead">
            <th colspan="3">Expression</th>
        </tr>
        <tr class="colhead">
            <th>Task</th>
            <th>Action</th>
            <th>List</th>
</tr>           
<tr class="rowLight">
    <td width="40%">
            Task1
        </td>
        <td width="20%">
             Assigned to 
        </td>
        <td width="40%">
             Harry
    </td>

</tr>           
<tr class="rowDark">
     <td width="40%">
                    Task2
                </td>
                <td width="20%">
                     Rejected by 
                </td>
                <td width="40%">
                    Lopa 
                </td>
</tr>

<tr class="rowLight">
    <td width="40%">
            Task5
        </td>
        <td width="20%">
             Accepted By 
        </td>
        <td width="40%">
            Mathew
        </td>
</tr>

现在我必须得到如下值：（下表只是一个 Excel 表，一旦达到这些值，我将建立它。）

Task    Action        List
Task1   Assigned to   Harry
Task2   Rejected by   Lopa
Task5   Accepted By   Mathew

我所知道的外行解决方案如下：

   from bs4 import BeautifulSoup
   soup = BeautifulSoup(source_URL)

alltables = soup.findAll( "table", {"border":"2", "width":"100%"} )

t = [x for x in soup.findAll('td')]

[x.renderContents().strip('\n') for x in t]

但是在我上面的HTML内容中没有这样的结构，那么如何处理呢？请在这里指导我！

score 2 · Accepted Answer

用于.stripped_strings从表格行中获取“有趣”的文本：

rows = table.find_all('tr', class_=('rowLight', 'rowDark'))
for row in rows:
    print list(row.stripped_strings)

这输出：

[u'Task1', u'Assigned to', u'Harry']
[u'Task2', u'Rejected by', u'Lopa']
[u'Task5', u'Accepted By', u'Mathew']

或者，将所有内容拉入一个列表列表（根据请求，不包括最后一行）：

data = [list(r.stripped_strings) for r in rows[:-1]]

变成：

data = [[u'Task1', u'Assigned to', u'Harry'], [u'Task2', u'Rejected by', u'Lopa']]

的结果.find_all()aResultSet就像一个 Python 列表，例如，您可以随意对其进行切片以忽略某些行。

python - 使用 BeautifulSoup 读取 html 表格内容感到困惑？

1 回答 1

Related

Reference