python - HtmlParser.entityref 实际上是匹配 html 实体引用的有效正则表达式吗？

Question

这是来自Python 2.7 HtmlParser的代码：

entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')

以前，我认为它更像这样：

entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*);')

所以我对来自一些奇怪来源的一些奇怪数据感到惊讶。

我的用例无关紧要；是否有任何理由像 HtmlParser 一样定义实体引用？

不相关的用例：如果有人想知道，我仍然描述我的用例。请注意，我不再试图解决我的用例。我的问题是 HtmlParser 的 entityref 是否有问题。

我的用例类似于：Strip HTML from strings in Python

我说的输入数据是这样的：

r'''<foo bar="blah"> asda&Il_'d@m_'<foo rab="halb">'''

我的用例的预期输出是r"""a&Il_'d@m_'""".

编辑我试图将正则表达式与这个 sgml 参考进行比较，据我所知，实体参考应该以;但我对这个话题不太熟悉，所以我想问一下。

score 2 · Accepted Answer

The syntactic production for reference end reads:

[61] reference end =
  ( refc | ;
    RE ) ? (13) CR

That means that the following are recognized as reference ends:

A REFerence Close delimiter (; in the reference syntax), as you expected
A Record End
Nothing (note the use of the ? metacharacter after the close parenthesis, meaning that both REFC and RE are optional)

If nothing is used as a reference end, the reference ends at the first non-name character after the name start character, as required by the rules of the reference recognition mode that has been entered at the Entity Reference Open delimiter (ERO &).

Note also that ERO is only used for the general entity reference production.

python - HtmlParser.entityref 实际上是匹配 html 实体引用的有效正则表达式吗？

1 回答 1

Related

Reference