10

我正在尝试对 MS FrontPage 生成的网站的 html 进行“defrontpagify”,并且正在编写一个 BeautifulSoup 脚本来执行此操作。

但是,我陷入了尝试从包含它们的文档中的每个标签中去除特定属性(或列表属性)的部分。代码片段:

REMOVE_ATTRIBUTES = ['lang','language','onmouseover','onmouseout','script','style','font',
                        'dir','face','size','color','style','class','width','height','hspace',
                        'border','valign','align','background','bgcolor','text','link','vlink',
                        'alink','cellpadding','cellspacing']

# remove all attributes in REMOVE_ATTRIBUTES from all tags, 
# but preserve the tag and its content. 
for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.findAll(attribute=True):
        del(tag[attribute])

它运行没有错误,但实际上并没有剥离任何属性。当我在没有外部循环的情况下运行它时,只需硬编码单个属性(soup.findAll('style'=True),它就可以工作。

任何人都知道这里的问题吗?

PS - 我也不太喜欢嵌套循环。如果有人知道更实用的地图/过滤器风格,我很乐意看到它。

4

5 回答 5

9

线

for tag in soup.findAll(attribute=True):

没有找到任何tags。可能有一种使用方法findAll;我不确定。但是,这有效:

import BeautifulSoup
REMOVE_ATTRIBUTES = [
    'lang','language','onmouseover','onmouseout','script','style','font',
    'dir','face','size','color','style','class','width','height','hspace',
    'border','valign','align','background','bgcolor','text','link','vlink',
    'alink','cellpadding','cellspacing']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''
soup = BeautifulSoup.BeautifulSoup(doc)
for tag in soup.recursiveChildGenerator():
    try:
        tag.attrs = [(key,value) for key,value in tag.attrs
                     if key not in REMOVE_ATTRIBUTES]
    except AttributeError: 
        # 'NavigableString' object has no attribute 'attrs'
        pass
print(soup.prettify())

请注意,此代码仅适用于 Python 3。如果您需要它在 Python 2 中工作,请参阅下面的 Nóra 回答。

于 2012-01-28T13:48:57.263 回答
6

Just ftr: the problem here is that if you pass HTML attributes as keyword arguments, the keyword is the name of the attribute. So your code is searching for tags with an attribute of name attribute, as the variable does not get expanded.

This is why

  1. hard-coding your attribute name worked[0]
  2. the code does not fail. The search just doesn't match any tags

To fix the problem, pass the attribute you are looking for as a dict:

for attribute in REMOVE_ATTRIBUTES:
    for tag in soup.find_all(attrs={attribute: True}):
        del tag[attribute]

Hth someone in the future, dtk

[0]: Although it needs to be find_all(style=True) in your example, without the quotes, because SyntaxError: keyword can't be an expression

于 2018-07-13T12:04:49.330 回答
6

这是 unutbu 答案的 Python 2 版本:

REMOVE_ATTRIBUTES = ['lang','language','onmouseover']

doc = '''<html><head><title>Page title</title></head><body><p id="firstpara" align="center">This is <i>paragraph</i> <a onmouseout="">one</a>.<p id="secondpara" align="blah">This is <i>paragraph</i> <b>two</b>.</html>'''

soup = BeautifulSoup.BeautifulSoup(doc)

for tag in soup.recursiveChildGenerator():
    if hasattr(tag, 'attrs'):
        tag.attrs = {key:value for key,value in tag.attrs.iteritems()
                    if key not in REMOVE_ATTRIBUTES}
于 2016-10-11T11:16:02.087 回答
2

我使用这种方法来删除属性列表,非常紧凑:

attributes_to_del = ["style", "border", "rowspan", "colspan", "width", "height", 
                     "align", "valign", "color", "bgcolor", "cellspacing", 
                     "cellpadding", "onclick", "alt", "title"]
for attr_del in attributes_to_del: 
    [s.attrs.pop(attr_del) for s in soup.find_all() if attr_del in s.attrs]


于 2020-05-16T17:58:46.570 回答
1

我用这个:

if "align" in div.attrs:
    del div.attrs["align"]

或者

if "align" in div.attrs:
    div.attrs.pop("align")

感谢https://stackoverflow.com/a/22497855/1907997

于 2018-11-16T15:03:16.887 回答