java - 正则表达式忽略 html 标记，但从具有不同结束锚点的单词边界开始

Question

首先让我说我需要一个仅正则表达式的解决方案。

我正在尝试使用第三个程序程序从 html 文件中提取描述。这个程序是基于java的，但我不能以任何方式操作源代码！. 我提交正则表达式的程序已经有另一个正则表达式脚本，指定从每个页面上获取描述的位置。如果您在其中定义匹配项，它具有这个方便的功能，可以进一步将该信息分解为一个数组。

我想匹配描述中的每个句子，无论它是否是列表项。摆脱标签将是理想的，因为它们会导致\b用于指定从哪里开始匹配的问题。

起初我以为我可以编写一个正则表达式解决方案来捕获单词边界和句子结尾字符之间的所有内容。像这样的事情\b([^.!]+)[.!]然后我注意到一个问题，描述有时会有一个带有列表项的附加部分。更复杂的是，有时列表项的第一部分会加粗或斜体。更罕见的是，由于我不明白的原因，其中可能会有一个随机<br>和标签......</br>

以下是一篇搞笑文章中对常见布局的示例描述：

Children around the world are constantly exposed to the evil “Dolan”, an evil 
duckwho encourages rape, murder, pedophilia, stealing, homosexuality and a range
of other sins.  ”Dolan” is considered a “meme”: an image that makes its way
around the internet via social networks such as Myspace, Friendster, or
Wikipedia.

<li>The duck is based on the character “Donald” created by the company Disney. 
</li><li><b>Dolan, however</b>, is more overtly satanic and enjoys commit crimes
and offending Christianity. </li><li>He is best known for a series of internet 
comics created in the socialist nation of Finland. </li><li><i>Being part of
Scandinavia</i>, the Finnish are clearly followers of Satan and Skrillex. </li>
<li>The comics are written in poor English to distract the viewer from how evil
and offensive they truly are.</li>

我尝试了一些不同的东西，但我仍然是一个正则表达式新手，并且得到了各种无法正常工作的不同回报。这个从标签中的任何字母开始打破了一切：

(?:<li>|<b>|<i>)?\b([^.!<]+)[.!< ][<lbi/ ]

上面的代码给出了一个这样的数组（顺序是随机的，或者至少以我不理解的方式组织）

i>
Being Part of Scandinavia
i>
b>
Dolan, however
b>

几乎相同的同一个可能会留在一些 html 标记中，我认为这是因为li>填充了单词边界要求。注意：下面代码末尾有一个空格

\b([^.!<]+)[.!]

这给出了一个这样的数组

li>The duck is based on the character “Donald”...
li>li>b>Dolan, however/b>, is more overtly satanic...

就像我之前说的那样，我是一个正则表达式的菜鸟，并且非常确定我使用了错误的前瞻。

请帮我解决问题！我不知道下一步该尝试什么。

PS，文章不是我写的，是从别的网站上抄来的。不想冒犯

score 1 · Accepted Answer

别管它\b，它只会妨碍你。你也不需要环顾四周。以下正则表达式正确匹配示例文本中的所有句子。与@icrf 的正则表达式一样，句子中的任何标签都将保留在那里。摆脱这些需要第二步，我看不出有什么办法。

[^\s<>.!?][^<>.!?]*(?:<[^<>]+>[^<>.!?]*)*[.!?]

分解它：

[^\s<>.!?]从下一个不是空格、尖括号或句子标点符号的字符开始匹配。
[^<>.!?]*继续匹配所需的字符，现在包括空格。
<[^<>]+>: 如果它找到一个左尖括号，这部分会尝试匹配一个 HTML 标签。然后它回到匹配非特殊字符与[^<>.!?]*. 它继续像这样进行权衡，直到没有更多的标签或非特殊字符可供使用。
最后，[.!?]匹配句尾标点符号。

score 0 · Accepted Answer

这个怎么样：

(?:^|(?<=[.!]))(?:</?[a-zA-Z][^>]*>)*([^<][^.!]+)(?:[.!]|$)

这个想法是尝试匹配从句子分隔符到另一个句子分隔符的所有内容。我正在使用积极的后视（(?<=[.!])部分）来匹配第一个分隔符，因此正则表达式实际上并没有使用该字符，只是检查它是否存在于正确的位置。

在您的示例文章上运行此正则表达式，我得到以下匹配项：

Children around the world are constantly exposed to the evil...
  ”Dolan” is considered a “meme”: an image that makes its way...
<li>The duck is based on the character “Donald” created by...
</li><li><b>Dolan, however</b>, is more overtly satanic and...
 </li><li>He is best known for a series of internet comics created...
 </li><li><i>Being part of Scandinavia</i>, the Finnish are clearly...
 </li><li>The comics are written in poor English to distract...

结果匹配仍然需要一些处理（即修剪空格和剥离标签），但至少正则表达式似乎正确匹配句子。

score 0 · Accepted Answer

\b(?<![</])(?!>)[^.?!]+[.!?]

这使句子外的 HTML 标记不匹配，但句子内的标记仍然存在并且必须被删除。没有它们就没有任何方法可以得到一个句子，因为它不会是一个连续的匹配，这是只使用正则表达式来解决这个问题的限制。

开始时的负后瞻(?<![</])和负前瞻(?!>)是为了避免匹配内部标签作为起点。

The duck is based on the character "Donald" created by the company Disney.
Dolan, however</b>, is more overtly satanic and enjoys commit crimes and offending Christianity.
He is best known for a series of internet comics created in the socialist nation of Finland.
Being part of Scandinavia</i>, the Finnish are clearly followers of Satan and Skrillex.
The comics are written in poor English to distract the viewer from how evil and offensive they truly are.

保留在里面的 HTML 不是有效的 HTML，因为开始或结束标记可能在句子本身之外（见第二句中没有开头的结束粗体）。

java - 正则表达式忽略 html 标记，但从具有不同结束锚点的单词边界开始

3 回答 3

Related

Reference