python - 使用 python 3.2 检测 html 文件中的数字

Question

我有一个 HTML 文件，我想使用 python 3.2 示例解析它：-

<td class="ln">15</td><td class="sf3b2"><code>&nbsp;</code></td>
<td class="ln">15</td><td class="sf3b2"><code>&nbsp;</code></td>

工作是检测未标记的数字（在本例中仅 15 个）并将它们存储在另一个文本文件中。我无法决定使用哪个 html 解析器（lxml，美丽的汤），因为我是新手。您能否指导我如何解决这个问题。在此先感谢！

score 0 · Accepted Answer

BeautifulSoup可以很容易地做到这一点。您可以使用find_all方法来查找元素并对其进行处理：

soup = BeautifulSoup(html_doc)
tds = soup.find_all("td", "ln")
for td in tds:
    pass # do something here

score 0 · Accepted Answer

你可以试试这样的。

from BeautifulSoup import BeautifulSoup

def getPrintUnicode(soup):

    body=''
    if isinstance(soup, unicode):
        soup = soup.replace('&#39;',"'")
        soup = soup.replace('&quot;','"')
        soup = soup.replace('&nbsp;',' ')
        soup = soup.replace('&gt;','>')
        soup = soup.replace('&lt;','<')
        body = body + soup
    else:
        if not soup.contents:
            return ''
        con_list = soup.contents
        for con in con_list:
            body = body + getPrintUnicode(con)
    return body

print getPrintUnicode(BeautifulSoup('<td class="ln">15</td><td class="sf3b2"><code>&nbsp;</code></td>'))

您可以在整个页面的汤中使用此 getPrintUnicode() 函数。它将返回完整的内容。使用异常并将字符串转换为整数。例如。

print int(getPrintUnicode(BeautifulSoup('<td class="ln">15</td><td class="sf3b2"><code>&nbsp;</code></td>')))

python - 使用 python 3.2 检测 html 文件中的数字

2 回答 2

Related

Reference