java - 为什么正则表达式可选非捕获组不作为可选并且搞砸匹配？

Question

我正在使用使用正则表达式搜索 html 文档的第 3 方应用程序，在这种情况下，它没有正确的结构（没有头部或主体），并将匹配项作为属性形式返回 excel 文件中。它不解析它们。我已经知道尝试使用正则表达式解析 html 所带来的恐惧。

因此，我编写了一个正则表达式，它应该捕获段落或列表项中的每个句子，但在检查匹配项后，我注意到有时它不会匹配所有句子，并且一旦该句子或列表项出现错误就会停止匹配。几乎总是使用列表项，但偶尔使用句子。在意识到这是由于人为错误造成的之后，我添加了可选的非捕获组，这完全搞砸了一切。

这是我写的最初的正则表达式，它在大多数情况下都有效：

([^<>]*?)[.!?<]|[ <"'/]

因为有些句子有错误，作者在标点符号前加了一个空格，所以我添加了可选的非捕获组：

([^<>]*?)(?:[ ])?[.!?<]|[ <"/l]

以下是它正在搜索的文本示例：

Buy this because it is soooooooooooooooooooo freaking awesome! If you buy this 
everyone will think you're "cool." You'll get all the babes !<br><br><ul><li>It 
will make you smell better<li>It will make you preform better.</li><li>Will make
you last longer in bed!<li>Will fix any acne problem.</li> <li>It will reduce the
amount you perspire to .01% your normal amount!<br><li>It will make you 
"invincible."</li></ul>

因为它们不能用作锚（文本从 html 文件的开头开始），所以我只是让它立即开始捕获。如您所见，它的编码很差并且有语法错误，这就是我以我的方式结束它的原因。

第一个捕获了大部分句子，但遗漏了一些......第二个返回一堆空白匹配，这搞砸了用捕获创建的数组。就好像它在非捕获组之后无视一切一样。

我想过这样做，但这会将每个单词都返回为匹配项：

([^<>]*?)[ .!?<]|[ .!?<"/l]

唯一的问题是这会在中间切断一些句子，并且需要第三个范围，我认为这会有一堆不同的选项（注意随机<br>标签）并且需要一段时间才能找到它们。

从外观上看，它并没有使用可选的非捕获组！为什么是这样？还是我忽略了一些非常简单的事情？我觉得可能是后者。

score 3 · Accepted Answer

我想出了这个野兽：

(?:^|\s+|>)((?:[^<>.!?\s])(?:[^<>.!?]|\.\d)+(?:\.(?!\d)"?|!|\?)?)

让我试着解释一下我在这里做什么。

(?:^|\s+|>)       # only start after at the string's beginning, after a row of
                  # spaces, or after closing a tag
                  # this eliminates all in-tag matches (like "li" and "br")
(                 # opening a capturing group that will contain the actual match
(?:[^<>.!?\s])    # require at least one character that is not in the given group
                  # this eliminates matching a single space between two <li>s
                  # NOTE: there are probably better ways to do this
(?:[^<>.!?]|\.\d) # defines possible sentence characters; allow everything but
                  # <, >, ., !, ? EXCEPT FOR . followed by a digit
(?:\.(?!\d)"?|!|\?)?
                  # include possible sentence endings; that is . not followed by
                  # a digit (hence, the negative lookahead), but possibly
                  # followed by ", or !, or ?, or nothing at all
)                 # close the main matching group

现在您应该能够在捕获的索引处访问您的句子1。

我相信你可能会遇到我对句子看起来像什么的假设会被打破的情况。但我只能从你给出的例子中工作，其中所有的奇怪之处都包括在内。

java - 为什么正则表达式可选非捕获组不作为可选并且搞砸匹配？

1 回答 1

Related

Reference