我想转换由PTB 风格的分词器生成的词素数组:
["The", "house", "is", "n't", "on", "fire", "."]
给一句话:
"The house isn't on fire."
什么是实现这一目标的明智方法?
我想转换由PTB 风格的分词器生成的词素数组:
["The", "house", "is", "n't", "on", "fire", "."]
给一句话:
"The house isn't on fire."
什么是实现这一目标的明智方法?
如果我们接受@sawa 对撇号的建议并将您的数组设为:
["The", "house", "isn't", "on", "fire", "."]
您可以通过以下方式获得所需的内容(带有标点符号支持!):
def sentence(array)
str = ""
array.each_with_index do |w, i|
case w
when '.', '!', '?' #Sentence enders, inserts a space too if there are more words.
str << w
str << ' ' unless(i == array.length-1)
when ',', ';' #Inline separators
str << w
str << ' '
when '--' #Dash
str << ' -- '
else #It's a word
str << ' ' unless str[-1] == ' ' || str.length == 0
str << w
end
end
str
end