c# - 改进正则表达式以将大文本拆分为句子

Question

可能重复：
解析单个句子的正则表达式是什么？

我想将大文本拆分成句子。我从这里的答案得到的正则表达式

string[] sentences = Regex.Split(mytext, @"(?<=[\.!\?])\s+");

所以我想使用一种模式来进行拆分，例如 a. ? !跟在 aspace和一个capital字母之后，而不是拆分。
大写字母表示句首。

text = " Sentence one . Sentence e.g. two ? Sentence three.
sentence[1] = Sentence one 
sentence[2] = Sentence e.g. two

对于像缩写这样有问题的情况，我打算替换

mytext.replace("e.g.","eg");

如何在正则表达式中实现这一点？

score 6 · Accepted Answer

\p{Lt}表示 Unicode 大写字母（包括重音符号等），所以

string[] sentences = Regex.Split(mytext, @"(?<=[.!?])\s+(?=\p{Lt})");

应该做你想做的。

（请注意，我不认为.或?不需要在字符类中转义，所以我也删除了它们，但请检查这是否仍然适用于这些字符。）

However, note that this will still split on e.g. Mr. Jones...

1 回答 1