python - 使用正则表达式提取所有 html attrs

Question

我想使用re模块从字符串中提取所有 html 节点，包括它们的所有属性。但是，我希望每个 attr 都是一个组，这意味着我可以使用matchobj.group()它们来获取它们。节点中的属性数量是灵活的。这就是我感到困惑的地方。我不知道如何编写这样的正则表达式。我已经尝试过</?(\w+)(\s\w+[^>]*?)*/?>'，但是对于像我这样的节点，<a href='aaa' style='bbb'>我只能使用[('a'), ('style="bbb")].
我知道有一些很好的 HTML 解析器。但实际上我不会提取 attrs 的值。我需要修改原始字符串。

score 3 · Accepted Answer

请不要使用正则表达式。使用BeautifulSoup：

>>> from bs4 import BeautifulSoup as BS
>>> html = """<a href='aaa' style='bbb'>"""
>>> soup = BS(html)
>>> mytag = soup.find('a')
>>> print mytag['href']
aaa
>>> print mytag['style']
bbb

或者如果你想要一本字典：

>>> print mytag.attrs
{'style': 'bbb', 'href': 'aaa'}

score 1 · Accepted Answer

描述

要捕获无限数量的属性，它需要一个两步过程，首先你拉整个元素。然后，您将遍历元素并获得一组匹配的属性。

正则表达式来获取所有元素：<\w+(?=\s|>)(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*?>

在此处输入图像描述

正则表达式从单个元素中获取所有属性：\s\w+=(?:'[^']*'|"[^"]*"|[^'"][^\s>]*)(?=\s|>)

在此处输入图像描述

Python 示例

请参阅工作示例：http ://repl.it/J0t/4

代码

import re

string = """
<a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>text</a>
""";

for matchElementObj in re.finditer( r'<\w+(?=\s|>)(?:[^>=]|=\'[^\']*\'|="[^"]*"|=[^\'"][^\s>]*)*?>', string, re.M|re.I|re.S):
    print "-------"
    print "matchElementObj.group(0) : ", matchElementObj.group(0)

    for matchAttributesObj in re.finditer( r'\s\w+=(?:\'[^\']*\'|"[^"]*"|[^\'"][^\s>]*)(?=\s|>)', string, re.M|re.I|re.S):
        print "matchAttributesObj.group(0) : ", matchAttributesObj.group(0)

输出

-------
matchElementObj.group(0) :  <a href="i.like.kittens.com" NotRealAttribute=' true="4>2"' class=Fonzie>
matchAttributesObj.group(0) :   href="i.like.kittens.com"
matchAttributesObj.group(0) :   NotRealAttribute=' true="4>2"'
matchAttributesObj.group(0) :   class=Fonzie

python - 使用正则表达式提取所有 html attrs

2 回答 2

描述

Python 示例

Related

Reference