python - 使用 Beautiful Soup 将 CSS 属性转换为单个 HTML 属性？

Question

我正在尝试编写一个程序，该程序将采用 HTML 文件并使其对电子邮件更友好。现在所有的转换都是手动完成的，因为没有一个在线转换器能完全满足我们的需要。

这听起来像是一个很好的机会，可以突破我的编程知识极限并实际编写一些有用的代码，所以我提出在业余时间尝试编写一个程序，以帮助使过程更加自动化。

我对 HTML 或 CSS 了解不多，所以我主要依靠我的兄弟（他知道 HTML 和 CSS）来描述这个程序需要做出哪些改变，所以如果我问了一个愚蠢的问题，请多多包涵。这对我来说是全新的领域。

大多数更改都是非常基本的——如果您看到标记/属性 X，然后将其转换为标记/属性 Y。但是在处理包含样式属性的 HTML 标记时遇到了麻烦。例如：

<img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />

只要有可能，我想将样式属性转换为 HTML 属性（或将样式属性转换为对电子邮件更友好的东西）。所以转换后它应该是这样的：

<img src="http://example.com/file.jpg" width="150" height="50" align="right"/>

现在我意识到并不是所有的 CSS 样式属性都有 HTML 等价物，所以现在我只想关注那些有的。我编写了一个 Python 脚本来进行这种转换：

from bs4 import BeautifulSoup
import re

class Styler(object):

    img_attributes = {'float' : 'align'}

    def __init__(self, soup):
        self.soup = soup

    def format_factory(self):
        self.handle_image()

    def handle_image(self):
        tag = self.soup.find_all("img", style = re.compile('.'))
        print tag
        for i in xrange(len(tag)):
            old_attributes = tag[i]['style']
            tokens = [s for s in re.split(r'[:;]+|px', str(old_attributes)) if s]
            del tag[i]['style']
            print tokens

            for j in xrange(0, len(tokens), 2):
                if tokens[j] in Styler.img_attributes:
                    tokens[j] = Styler.img_attributes[tokens[j]]

                tag[i][tokens[j]] = tokens[j+1]

if __name__ == '__main__':
    html = """
    <body>hello</body>
    <img src="http://example.com/file.jpg" style="width:150px;height:50px;float:right" />
    <blockquote>my blockquote text</blockquote>
    <div style="padding-left:25px; padding-right:25px;">text here</div>
    <body>goodbye</body>
    """
    soup = BeautifulSoup(html)
    s = Styler(soup)
    s.format_factory()

现在这个脚本可以很好地处理我的特定示例，但它不是很健壮，我意识到当与现实世界的示例对比时，它很容易崩溃。我的问题是，我怎样才能使它更健壮？据我所知，Beautiful Soup 无法更改或提取样式属性的各个部分。我想这就是我想要做的。

score 11 · Accepted Answer

对于这种类型的事情，我建议将 HTML 解析器（如 BeautifulSoup 或 lxml）与专门的 CSS 解析器结合使用。我在 cssutils package上取得了成功。与尝试提出正则表达式来匹配您可能在野外找到的任何可能的 CSS 相比，您将拥有更轻松的时间。

例如：

>>> import cssutils
>>> css = 'width:150px;height:50px;float:right;'
>>> s = cssutils.parseStyle(css)
>>> s.width
u'150px'
>>> s.height
u'50px'
>>> s.keys()
[u'width', u'height', u'float']
>>> s.cssText
u'width: 150px;\nheight: 50px;\nfloat: right'
>>> del s['width']
>>> s.cssText
u'height: 50px;\nfloat: right'

因此，使用它，您可以非常轻松地提取和操作所需的 CSS 属性，并使用 BeautifulSoup 直接将它们插入 HTML。不过，请注意属性中弹出的换行符cssText。我认为 cssutils 更适合将内容格式化为独立的 CSS 文件，但它足够灵活，可以主要用于您在这里所做的事情。

score 2 · Accepted Answer

2

而不是重新发明轮子使用石器包http://pypi.python.org/pypi/StoneageHTML

于 2012-05-16T09:21:02.870 回答

python - 使用 Beautiful Soup 将 CSS 属性转换为单个 HTML 属性？

2 回答 2

Related

Reference