python - 使用 BeautifulSoup 解析 html 元素

Question

假设我有：

<tr>
   <td class="prodSpecAtribute">word</td>
   <td colspan="5">
      <a href="http://www.cmegroup.com/clearing/trading-practices/CMEblock-trade.html" target="_blank">another_word</a>
   </td>
</tr>

我想在 2 个 td 类中提取文本（word 和 another_word：所以我使用了 BeautifulSoup：

这是 Matijn Pieters 要求的代码：基本上，它从 html 页面（从表中）获取信息并将这些值存储在左右列列表中。然后，我根据这些详细信息创建一个字典（使用左列列表作为键，对于值，我使用右列列表）

def get_data(page):

soup = BeautifulSoup(page)

left = []
right = []

#Obtain data from table and store into left and right columns
#Iterate through each row
for tr in soup.findAll('tr'):

    #Find all table data(cols) in that row
    tds = tr.findAll('td')

    #Make sure there are 2 elements, a col and a row
    if len(tds) >= 2:

        #Find each entry in a row -> convert to text
        right_col = []
        inp = []
        once = 0
        no_class = 0
        for td in tds:
            if once == 0:               
                #Check if of class 'prodSpecAtribute'
                if check(td) == True:
                    left_col = td.findAll(text=True)
                    left_col_x = re.sub('&\w+;', '', str(left_col[0]))
                    once = 1
                else:
                    no_class = 1
                    break

            else:        
                right_col = td.findAll(text=True)
                right_col_x = ' '.join(text for text in right_col if text.strip())
                right_col_x = re.sub('&\w+;', '', right_col_x)
                inp.append(right_col_x)


        if no_class == 0:
            inps = '. '.join(inp)
            left.append(left_col_x)
            right.append(inps)

#Create a Dictionary for left and right cols
item = dict(zip(left, right))
return item

score 1 · Accepted Answer

您可以使用 HTQL ( http://htql.net )。

这是您的示例：

import htql
page="""
   <tr>
      <td class="prodSpecAtribute">word</td>
      <td colspan="5">
          <a href="http://www.cmegroup.com/clearing/trading-practices/CMEblock-trade.html" target="_blank">another_word</a>
      </td>
   </tr>
   """

query = """
   <tr>{ 
      c1 = <td (class='prodSpecAtribute')>1 &tx;
      c2 = <td>2 &tx &trim;
   }
   """ 

a=htql.query(page, query)
print(dict(a))

它打印：

{'word': 'another_word'}

python - 使用 BeautifulSoup 解析 html 元素

1 回答 1

Related

Reference