python - 请求一个健壮的正则表达式来检查一个 img 标记是否包含 HTML 文档中的 alt 元素

Question

我正在编写一个 python 脚本来检查 HTML 文档的 IMG 标签。它应该检查 alt="" 是否存在于 IMG 标记内。然后它会打印出行号。

正则表达式必须考虑不同的内容顺序。例如：

<img class="" alt="" src="">
<img class="" src="">
<img src="" class="">
<img src="">

所以，是的，夏天。正则表达式检查 img 标签的所有元素是否存在，它必须考虑一系列可能的安排

谢谢

score 2 · Accepted Answer

使用正则表达式来评估 HTML 有点冒险，但如果你愿意接受缺点*，你可以使用积极的前瞻断言来让它工作：

regex = re.compile(r'<img (?=[^>]*\balt=")(?=[^>]*\bsrc=")(?=[^>]*\bclass=")')

如果当前字符串包含<img（在同一标记内）后跟alt=",src="和class=", 的任何顺序，则将匹配。

解释：

<img    # Match '<img'
(?=     # Assert that it's possible to match the following from this position:
 [^>]*  #  Any number of characters except >
 \b     #  A word boundary (here: start of a word)
 alt="  #  The literal text 'alt="'
)       # End of lookahead
(?=[^>]*\bsrc=")   # Do the same for `src`, from the same position as before
(?=[^>]*\bclass=") # Do the same for `class`, from the same position as before

_{*当然，这个正则表达式完全不知道它匹配的标签是否在评论中、被评论打断、格式错误、被<pre>标签包围或任何其他可能改变其对实际 HTML 解析器含义的情况。}

python - 请求一个健壮的正则表达式来检查一个 img 标记是否包含 HTML 文档中的 alt 元素

1 回答 1

Related

Reference