ruby - 将文本拆分成句子，但跳过引用的内容

Question

我想使用正则表达式（使用 Ruby）将一些文本分成句子。它不需要准确，因此可以忽略“华盛顿特区”等案例。

但是我有一个要求，如果句子被引用（单引号或双引号），那么它应该被忽略。

假设我有以下文字：

第一句。“哇。” 爱丽丝说。第三句。

应该分成三句话：

第一句。
“哇。” 爱丽丝说。
第三句。

目前我有content.scan(/[^\.!\?\n]*[\.!\?\n]/)，但我有引号问题。

更新：

当前的答案可能会遇到一些性能问题。尝试以下操作：

'Alice stood besides the table. She looked towards the rabbit, "Wait! Stop!", said Alice'.scan(regexp)

如果有人能弄清楚如何避免它，那就太好了。谢谢！

score 8 · Accepted Answer

这个怎么样：

result = subject.scan(
    /(?:      # Either match...
     "[^"]*"  # a quoted sentence
    |         # or
     [^".!?]* # anything except quotes or punctuation.
    )++       # Repeat as needed; avoid backtracking
    [.!?\s]*  # Then match optional punctuation characters and/or whitespace./x)

ruby - 将文本拆分成句子，但跳过引用的内容

1 回答 1

Related

Reference