python - 使用python解析来自html的信息行

Question

我想找到一种方法来解析以下信息：

<tr>
   <td class="prodSpecAtribute">Rulebook Chapter</td>
   <td colspan="5">
     <a href="http://cmegroup.com/rulebook/CME/V/450/452/452.pdf" target="_blank" title="CME Chapter 452">CME Chapter 452</a>
   </td>
</tr>

<tr>
   <td class="prodSpecAtribute" rowspan="2">
      Trading Hours
      <br>
      (All times listed are Central Time)
   </td>
   <td>OPEN OUTCRY</td>
   <td colspan="4">
      <div class="font_black Large_div_td">MON-FRI: 7:20 a.m. - 2:00 p.m.</div>
   </td>
</tr>
<tr>
   <td>CME GLOBEX</td>  #PROBLEM HERER -- WANT this and  div below to be one row, considered under class <td class="prodSpecAtribute" rowspan="2"> ... Trading Hours... 

   <td colspan="4">
      <div class="font_black Large_div_td">SUN - FRI: 5:00 p.m. - 4:00 p.m. CT</div>
   </td>
</tr>

我能够轻松地解析顶部表格中的信息，如下所示：

soup = BeautifulSoup(page)
left_col = soup.findAll('td', attrs={'class' : 'prodSpecAtribute'})
right_col= soup.findAll('td', colspan=['4', '5'])

所以在这个例子中，有 3 行：2 行class "prodSpecAtribute"至少有一列对应于每个类。但是，最后一行没有 class，所以我需要一种方法来使用最后一个类并在同一个类下定义这个新的，以及第三行的 2 <td>：CME GLOBEX and SUN - FRI: 5:00 p.m. - 4:00 p.m. CT

combine_column 方法：

def combine_col(right):
    num = len(right)

    for i in range(0, num):
        text_ = ' '.join(right[i].findAll(text=True))
        print text_

    return text_

score 1 · Accepted Answer

合并第二行的第二列和第三列的明显方法是显式迭代这些行。您编写的任何内容find_all都只会将 row0-col1、row1-col1 和 row1-col2 作为三个单独的值返回，并且您无法知道哪些是在一起的。

所以，如果我理解你的问题，你想要这样的东西：

left_col = []
right_col = []
for tr in soup.find_all('tr'):
    tds = tr.find_all('td')
    left, right = tds[0], tds[1:]
    assert('prodSpecAtribute' in left['class'])
    left_col.append(left)
    right_col.append(combine_columns(right))

除了您需要编写该combine_columns代码之外，因为我不知道您想如何在列中“组合信息”。

我显然使用的是第 0 列在左侧的规则，而不是任何具有 class 的列prodSpecAttribute。我这样做主要是因为我无法弄清楚对于没有此类列的行，或者它不是最左边的列，您希望发生什么。所以，我只是添加了一个assert完整性检查，以验证这对于您的来源始终是正确的规则。

python - 使用python解析来自html的信息行

combine_column 方法：

1 回答 1

Related

Reference