1

这是来自Python 2.7 HtmlParser的代码:

entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*)[^a-zA-Z0-9]')

以前,我认为它更像这样:

entityref = re.compile('&([a-zA-Z][-.a-zA-Z0-9]*);')

所以我对来自一些奇怪来源的一些奇怪数据感到惊讶。

我的用例无关紧要;是否有任何理由像 HtmlParser 一样定义实体引用?


不相关的用例:如果有人想知道,我仍然描述我的用例。请注意,我不再试图解决我的用例。我的问题是 HtmlParser 的 entityref 是否有问题。

我的用例类似于:Strip HTML from strings in Python

我说的输入数据是这样的:

r'''<foo bar="blah"> asda&Il_'d@m_'<foo rab="halb">'''

我的用例的预期输出是r"""a&Il_'d@m_'""".


编辑我试图将正则表达式与这个 sgml 参考进行比较,据我所知,实体参考应该;但我对这个话题不太熟悉,所以我想问一下。

4

1 回答 1

2

The syntactic production for reference end reads:

[61] reference end =
  ( refc | ;
    RE ) ? (13) CR

That means that the following are recognized as reference ends:

  • A REFerence Close delimiter (; in the reference syntax), as you expected
  • A Record End
  • Nothing (note the use of the ? metacharacter after the close parenthesis, meaning that both REFC and RE are optional)

If nothing is used as a reference end, the reference ends at the first non-name character after the name start character, as required by the rules of the reference recognition mode that has been entered at the Entity Reference Open delimiter (ERO &).

Note also that ERO is only used for the general entity reference production.

于 2014-11-27T23:14:11.867 回答