python - Removing specific html tags with python

Question

I have some HTML tables inside of an HTML cell, like so:

miniTable='<table style="width: 100%%" bgcolor="%s">
               <tr><td><font color="%s"><b>%s</b></td></tr>
           </table>' % ( bgcolor, fontColor, floatNumber)

html += '<td>' + miniTable + '</td>'

Is there a way to remove the HTML tags that pertain to this minitable, and only these html tags?
I would like to somehow remove these tags:

<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b>
and
</b></td></tr></table>

to get this:

floatNumber

where floatNumber is the string representation of a floating point number. I don't want any of the other HTML tags to be modified in any way. I was thinking of using string.replace or regex, but I'm stumped.

score 2 · Accepted Answer

如果您无法安装和使用 Beautiful Soup（否则首选 BS，正如@otto-allmendinger 建议的那样）：

import re
s = '<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b>1.23</b></td></tr></table>'
result = float(re.sub(r"<.?table[^>]*>|<.?t[rd]>|<font[^>]+>|<.?b>", "", s))

score 2 · Accepted Answer

不要使用 str.replace 或正则表达式。

使用Beautiful Soup之类的 html 解析库，获取所需的元素和包含的文本。

最终的代码应该是这样的

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

for t in soup.find_all("table"): # the actual selection depends on your specific code
    content = t.get_text()
    # content should be the float number

python - Removing specific html tags with python

2 回答 2

Related

Reference