php - 在正则表达式中不允许的标签处截断字符串

Question

我得到了这个工作正常的正则表达式，可与 php 的 preg_match_all 一起使用，以匹配包含句子/字符串中特定单词之前 0 到 x 行和 0 到 y 行的字符串：

'(?:[^\.?!<]*[\.?!]+){0,x}(?:[^\.?!]*)'.$word.'(?:[^\.?!]*)(?:[\.?!]+[^\.?!]*){0,y}'.'(?:[\.?!]+)'

现在，我希望在出现特定标签时切断字符串。所以我正在考虑在上面的这个字符串中实现这部分：

(?:(<\/?(?!'.$allowed_tags.')))

其中 $allowed_tags 是一个 php 变量，可能如下所示：'(frame|head|span|script)'

尽管试图让它与前瞻、后瞻和其他条件一起工作，但我无法让它正常工作，不幸的是我不得不承认这超出了我的编程技能。

希望有人可以帮助我吗？我相信你们天才中的某个人可以:)

提前非常感谢！

输入输出示例：

例如，我想抓住这部分：

<p>Tradition, Expansion, Exile.<br/>Individual paths in Chinese contemporary art </p><p>The contemporary <i>art world</i> craves for novelty: the best reason for Chinese art to be so trendy is also the <strong>worst one</strong>.</p>

从这个完整的字符串：

<div readability="120"><p>Tradition, Expansion, Exile.<br/>Individual paths in Chinese contemporary art </p><p>The contemporary <i>art world</i> craves for novelty: the best reason for Chinese art to be so trendy is also the <strong>worst one</strong>.</p><div>

这意味着在这个例子<p></p><i></i><strong></strong> <br/>中是允许的标签和<div >不是</div>。

score 1 · Accepted Answer

假设您根据您的评论div定义并span标记为“非法” ，则以下正则表达式将匹配句子 conatining之前和之后的句子，只要这些句子不包含“非法”标签：xy$word

'(?:(?<=[.!?]|^)(?:(?<!<div|<\/div|<span|<\/span)>|[^>.!?])+[.!?]+){0,x}[^.!?]*'.$word.'[^.!?]*[.!?]+(?:(?:<(?!\/?div|\/?span)|[^<.!?])*[.!?]+){0,y}'

拆分并解释（引号和字符串连接运算符已删除，添加注释和换行符以便更好地阅读）：

                                     // 0 TO X LEADING SENTENCES
(?: ---------------------------------// do not create a capture group
  (?<=[.!?]|^) ----------------------// match only after sentence end or start of string
  (?: -------------------------------// do not create a capture group
    (?<!<div|<\/div|<span|<\/span)> -// match “&gt;” only if not preceded by span or div tags
    |[^>.!?] ------------------------// or any any other, non punctuation character
  )+ --------------------------------// one or more times
  [.!?]+ ----------------------------// followed by one or more punctuation characters
){0,x} ------------------------------// the whole sentence repeated 0 to x times
                                     // MIDDLE SENTENCE WITH KEYWORD
[^.!?]* -----------------------------// match 0 or more non-punctuation characters
$word -------------------------------// match string value of $word
[^.!?]* -----------------------------// match 0 or more non-punctuation characters
[.!?]+ ------------------------------// followed by one or more punctuation characters
                                     // 0 TO Y TRAILING SENTENCES
(?: ---------------------------------// do not create a capture group
  <(?!<\/?div|\/?span) --------------// match “&lt;” not followed by a “div” or “span” tag
  |[^<.!?] --------------------------// or any non-punctuation character that is not “&lt;”
  )* --------------------------------// zero or more times
  [.!?]+ ----------------------------// followed by one or more punctuation characters
){0,y} ------------------------------// the whole sentence repeated 0 to y times

请注意，用于匹配之前句子的后向断言$word只会匹配没有属性的开始和结束标签，并且必须从字面上匹配开始和结束标签变体，因为后向断言不能是可变长度的。还有其他限制和陷阱：

值得注意的是，如果正则表达式位于包含$word
而“inside”一个句子的字面意思是“在前一个句子的结束标点之后”，虽然形式上是正确的，但可能不是预期的。

所有这些都凸显了基于正则表达式的方法解决问题的局限性。鉴于此，您可能会认为切换到更程序化的方法（例如将所有句子解析为一个数组而不考虑标签，然后扫描“非法”标签并相应地修剪或拒绝数组，这将允许更灵活的标签匹配regex) 会工作得更好，而且你是对的，如果不是因为将自然语言结构（如句子）与正则表达式匹配具有任何程度的准确性的潜在困难。我会让你思考这个问题和答案中使用的“句子拆分”正则表达式会对以下内容产生什么影响：

“TJ Hooker 由 Starship Enterprise 的 W. Shatner 设计（原文如此。）成名”</p>

这不漂亮。结果也不是。

php - 在正则表达式中不允许的标签处截断字符串

1 回答 1

Related

Reference