regex - 正则表达式模式来查找 & 出现在文本中但不在实体中？

Question

我需要匹配&纯文本中存在的内容，但它不应该&从实体中捕获，例如i

例如，

hi this is a plain text containing & and the entity &#x45; , &#65286; and &amp;

在上面的文本中，我应该只找到文本中的&那个——即，在containing. 我尝试了这种模式&[^#x]*，但我无法获得所有匹配项。

score 4 · Accepted Answer

被盗的正则表达式与来自另一个答案的 HTML 实体相匹配，并结合了前瞻：

&(?!(amp|apos|gt|lt|nbsp|quot|bull|hellip|[lr][ds]quo|[mn]dash|permil|
     \#[1-9]\d{1,3}|[A-Za-z][0-9A-Za-z]+);)

缩短：

&(?!(\#[1-9]\d{1,3}|[A-Za-z][0-9A-Za-z]+);)

解释：

我们想要匹配&但不是&123;等等。

&                 // match an ampersand
(                 // group starts
    ?!            // negative look-ahead (don't match '&' if this group matches)
    (\#[1-9]\d{1,3}|[A-Za-z][0-9A-Za-z]+); // regex to match HTML entity after '&'
)                 // group ends

score 0 · Accepted Answer

与[^#x]您匹配所有不是“#”或“x”的单个字符。你可能想要的是&[^#][^x]. 如果字符串末尾可能有 '&' 或字符串可能少于 3 个字符，则必须另外考虑这些情况。

PS：转义取决于您的正则表达式的实际风格。

编辑

对于&amp（以及所有其他 HTML 实体，例如!= &excl;），您可以简单地提供替代方案，例如 &([^#][^x])|([^a][^m][^p])|([^e][^x][^c][^l])

如果您的正则表达式风格允许前瞻断言，则它更易于使用&(?!(#x|amp|excl))等。

regex - 正则表达式模式来查找 & 出现在文本中但不在实体中？

2 回答 2

缩短：

解释：

Related

Reference