ruby - 将正文拆分成句子但保留标点符号？

Question

我正在尝试在 2 段 html 载文文本之间产生类似 wiki 的人类可读差异。我正在使用 diff-lcs ，第一步是将字符串（字符数组）分成句子数组，但保留它们的标点符号。

"I am a lion. Hear me roar! Where is my cub? Never mind, found him.".magic_split(/[.?!]/)
# => "I am a lion." "Hear me roar!" "Where is my cub?" "Never mind, found him."

这应该可以解决问题

"I am a lion. Hear me roar! Where is my cub? Never mind, found him.".gsub(/[.?!]/, '\1|').split('|')

除了 gsub 似乎无法插入字符.?!。相反，它返回了这个

"I am a lion| Hear me roar| Where is my cub| Never mind, found him|"

进行非破坏性拆分的最简单方法是什么？因为它保留了分割的字符。

score 13 · Accepted Answer

scan应该做的伎俩（扔strip在那里摆脱尾随空格）。

s = "I am a lion. Hear me roar! Where is my cub? Never mind, found him."
s.scan(/[^\.!?]+[\.!?]/).map(&:strip) # => ["I am a lion.", "Hear me roar!", "Where is my cub?", "Never mind, found him."]

score 3 · Accepted Answer

我认为应该是\0

>> string = "I am a lion. Hear me roar! Where is my cub? Never mind, found him."
>> string.gsub(/[.?!]/, '\0|') 
   # "I am a lion.| Hear me roar!| Where is my cub?| Never mind, found him.|"

ruby - 将正文拆分成句子但保留标点符号？

2 回答 2

Related

Reference