python - 如何将汤文件更改为“dict”保存？

Question

我有两张相同的桌子，att:class没有其他att的tror td。

<table class='content'>
  <caption>
     <em> table1 </em>
  </caption>
  <tbody>
     <tr>
       <th> A </th>
       <th> B </th>
       <th> C </th>
     </tr>
     <tr>
       <td> a1 <td>
       <td> b1 <td>
       <td> c1 <td>
     </tr>
     <tr>
       <td> a2 <td>
       <td> b2 <td>
       <td> c2 <td>
     </tr>
   </tbody>
</table>

<table class='content'>
  <caption>
     <em> table2 </em>
  </caption>
  <tbody>
     <tr>
       <th> A </th>
       <th> B </th>
       <th> C </th>
     </tr>
     <tr>
       <td> a3 <td>
       <td> b3 <td>
       <td> c3 <td>
     </tr>
     <tr>
       <td> a4 <td>
       <td> b4 <td>
       <td> c4 <td>
     </tr>
   </tbody>
</table>

然后我想要一个像

{table1:[ {A:[a1,a2]}, {B:[b1,b2]}, {C:[c1,c2]} ], table2:[ {A:[a3,a4]}, {B:[b3,b4]}, {C:[c3,c4]} ], }

任何人都可以帮助我获得这个字典或类似的字典吗？

score 1 · Accepted Answer

试试这个（另请注意，你有<td>...<td>而不是<td>...</td>）：

import bs4

your_html = """..."""
soup = bs4.BeautifulSoup(your_html)
big_dict = {}

for table in soup.find_all("table"):
    key = table.find("em").get_text().strip()
    big_dict[key] = []
    headers = []
    for th in table.find_all("th"):
        headers.append(th.get_text().strip())
        big_dict[key].append({headers[-1]: []})
    for row in table.find_all("tr"):
        for i, cell in enumerate(row.find_all("td")):
            big_dict[key][i][headers[i]].append(cell.get_text().strip())

print(big_dict)

上面给了我：

{'table1': [{'A': ['a1', 'a2']}, {'B': ['b1', 'b2']}, {'C': ['c1', 'c2']}], 'table2': [{'A': ['a3', 'a4']}, {'B': ['b3', 'b4']}, {'C': ['c3', 'c4']}]}

score 0 · Accepted Answer

您要查找的是表行数据，映射到作为表的caption键链接的表头。

{
    table[0].caption: {
        th[n] : [
          col[n][0],
          col[n][1],
          col[n][1]]
    }
}

因此，您需要将任务分解为：

获取表格的标题
获取表头
循环遍历表的每一行，将每个td的索引保存为表中的相应列。

我可以为您指明在 HTML 文档中进行搜索的文档的方向，而不是为您编写代码。

请提出更具体的问题，我们可以在未来给您更直接的答案。

python - 如何将汤文件更改为“dict”保存？

2 回答 2

Related

Reference