1

I have some HTML tables inside of an HTML cell, like so:

miniTable='<table style="width: 100%%" bgcolor="%s">
               <tr><td><font color="%s"><b>%s</b></td></tr>
           </table>' % ( bgcolor, fontColor, floatNumber)

html += '<td>' + miniTable + '</td>'

Is there a way to remove the HTML tags that pertain to this minitable, and only these html tags?
I would like to somehow remove these tags:

<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b>
and
</b></td></tr></table>

to get this:

floatNumber

where floatNumber is the string representation of a floating point number. I don't want any of the other HTML tags to be modified in any way. I was thinking of using string.replace or regex, but I'm stumped.

4

2 回答 2

2

如果您无法安装和使用 Beautiful Soup(否则首选 BS,正如@otto-allmendinger 建议的那样):

import re
s = '<table style="width: 100%%" bgcolor="%s"><tr><td><font color="%s"><b>1.23</b></td></tr></table>'
result = float(re.sub(r"<.?table[^>]*>|<.?t[rd]>|<font[^>]+>|<.?b>", "", s))
于 2012-07-13T14:43:20.117 回答
2

不要使用 str.replace 或正则表达式。

使用Beautiful Soup之类的 html 解析库,获取所需的元素和包含的文本。

最终的代码应该是这样的

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc)

for t in soup.find_all("table"): # the actual selection depends on your specific code
    content = t.get_text()
    # content should be the float number
于 2012-07-13T14:40:06.247 回答