ruby - 如何将一段文本解析成句子？（最好在 Ruby 中）

Question

考虑到 Mr. and Dr. 和 USA 等案例，您如何将段落或大量文本分解成句子（最好使用 Ruby）？（假设您只是将句子放入数组数组中）

更新：我想到的一种可能的解决方案是使用词性标注器（POST）和分类器来确定句子的结尾：

从琼斯先生那里得到数据，当他走到意大利避暑别墅的阳台上时，他感觉到温暖的阳光照在他的脸上。他很高兴能活着。

分类器 Mr./PERSON Jones/PERSON 感觉/O 温暖/O 太阳/O 上/O 他/O 脸/O 作为/O 他/O 踩/O 出/O 上/O/O 阳台/O的/O 他的/O 夏天/O 家/O 在/O 意大利/LOCATION ./O 他/O 是/O 快乐/O 到/O 是/O 活着/O./O

POST Mr./NNP Jones/NNP 毡/VBD/DT 暖/JJ sun/NN on/IN his/PRP$ face/NN as/IN he/PRP step/VBD out/RP on/IN/DT 阳台/ NN 的/IN 他的/PRP$ 夏天/NN 家/NN 在/IN 意大利。/NNP 他/PRP 是/VBD 高兴/JJ 到/TO 是/VB 活着。/IN

我们可以假设，由于意大利是一个地点，句号是句子的有效结尾吗？自从以“先生”结束没有其他词性，我们可以假设这不是一个有效的句末期吗？这是对我的问题的最佳答案吗？

想法？

score 13 · Accepted Answer

尝试查看Stanford Parser 周围的 Ruby 包装器。它有一个 getSentencesFromString() 函数。

score 8 · Accepted Answer

只是说清楚，没有简单的解决方案。这是 NLP 研究的主题，如Google 快速搜索所示。

但是，似乎有一些开源项目处理支持句子检测的 NLP，我发现了以下基于 Java 的工具集：

开放式自然语言处理

附加评论：决定句子从哪里开始和结束的问题在自然语言处理中也称为句子边界消歧（SBD）。

score 6 · Accepted Answer

6

看起来这颗红宝石可以解决问题。

https://github.com/zencephalon/Tactful_Tokenizer

于 2010-05-06T16:03:03.373 回答

score 5 · Accepted Answer

看一下NLTK (Natural Language Tool Kit) 中的 Python 分句器：

Punkt 句子分词器

它基于以下论文：

Kiss、Tibor 和 Strunk，Jan (2006)：无监督多语言句子边界检测。 计算语言学32：485-525。

论文中的方法非常有趣。他们将句子分裂的问题简化为确定一个单词与以下标点符号的关联程度的问题。缩写词后句点的重载是造成大多数模棱两可句点的原因，因此，如果您能识别缩写词，则很有可能识别句子边界。

我已经非正式地测试了这个工具，它似乎为各种（人类）语言提供了良好的结果。

将它移植到 Ruby 中并非易事，但它可能会给您一些想法。

score 4 · Accepted Answer

如果你真的很想把它做好，这是一个难题。您会发现 NLP 解析器包可能提供此功能。如果您想要更快的东西，您最终需要使用经过训练的令牌窗口的概率函数来复制其中的一些功能（您可能希望将换行视为令牌，因为如果这是一段的结尾）。

编辑：如果您可以使用 Java，我推荐斯坦福解析器。我不推荐其他语言，但我很想知道还有什么是开源的。

score 2 · Accepted Answer

不幸的是，我不是一个 ruby 人，但也许 perl 中的一个例子会让你朝着正确的方向前进。使用不匹配的look behind 结尾标点符号，然后一些特殊情况在 not behind 后跟任意数量的空格，然后是大写字母。我敢肯定这并不完美，但我希望它能为您指明正确的方向。不知道你怎么知道美国是否真的在句子的结尾......

#!/usr/bin/perl

$string = "Mr. Thompson is from the U.S.A. and is 75 years old. Dr. Bob is a dentist. This is a string that contains several sentances. For example this is one. Followed by another. Can it deal with a question?  It sure can!";

my @sentances = split(/(?:(?<=\.|\!|\?)(?<!Mr\.|Dr\.)(?<!U\.S\.A\.)\s+(?=[A-Z]))/, $string);

for (@sentances) {
    print $_."\n";
}

score 2 · Accepted Answer

同意接受的答案，使用斯坦福核心 NLP 是不费吹灰之力的。

但是，在 2016 年， Stanford Parser与更高版本的 stanford core nlp接口存在一些不兼容问题（我遇到了Stanford Core NLP v3.5的问题）。

以下是我使用 Ruby 与 Stanford Core NLP 接口将文本解析为句子的方法：

安装斯坦福 CoreNLP gem：

gem install stanford-core-nlp

然后按照使用最新版本的 Stanford CoreNLP 自述文件中的说明进行操作：

使用最新版本的斯坦福 CoreNLP（3.5.0 版截至 2014 年 10 月 31 日）需要一些额外的手动步骤：

从http://nlp.stanford.edu/下载Stanford CoreNLP 3.5.0 版。

将提取的存档内容放在 stanford-core-nlp gem 的 /bin/ 文件夹中（例如 [...]/gems/stanford-core-nlp-0.x/bin/）或配置的目录位置通过设置 StanfordCoreNLP.jar_path。

从http://nlp.stanford.edu/下载完整的 Stanford Tagger 3.5.0 版。

在 stanford-core-nlp gem 的 /bin/ 文件夹（例如 [...]/gems/stanford-core-nlp-0.x/bin/ ）或由设置 StanfordCoreNLP.jar_path。

将提取的存档内容放在 taggers 目录中。

从https://github.com/louismullie/stanford-core-nlp下载bridge.jar 文件。

将下载的 bridger.jar 文件放在 stanford-core-nlp gem 的 /bin/ 文件夹中（例如 [...]/gems/stanford-core-nlp-0.x/bin/taggers/）或目录中通过设置 StanfordCoreNLP.jar_path 配置。

然后将文本拆分为句子的 ruby 代码：

require "stanford-core-nlp"

#I downloaded the StanfordCoreNLP to a custom path:
StanfordCoreNLP.jar_path = "/home/josh/stanford-corenlp-full-2014-10-31/"
  
StanfordCoreNLP.use :english
StanfordCoreNLP.model_files = {}
StanfordCoreNLP.default_jars = [
  'joda-time.jar',
  'xom.jar',
  'stanford-corenlp-3.5.0.jar',
  'stanford-corenlp-3.5.0-models.jar',
  'jollyday.jar',
  'bridge.jar'
]

pipeline =  StanfordCoreNLP.load(:tokenize, :ssplit)

text = 'Mr. Josh Weir is writing some code. ' + 
  'I am Josh Weir Sr. my son may be Josh Weir Jr. etc. etc.'
text = StanfordCoreNLP::Annotation.new(text)
pipeline.annotate(text)
text.get(:sentences).each{|s| puts "sentence: " + s.to_s}
  
#output:
#sentence: Mr. Josh Weir is writing some code.
#sentence: I am Josh Weir Sr. my son may be Josh Weir Jr. etc. etc.

score 1 · Accepted Answer

也许尝试将其拆分为句点，后跟一个空格，后跟一个大写字母？我不确定如何找到大写字母，但这将是我开始研究的模式。

编辑： 用 Ruby 查找大写字母。

另一个编辑：

检查不以大写字母开头的单词后面的句子结尾标点符号。

score 1 · Accepted Answer

如果您正在考虑 JAVA（以及 Ruby 也很困难 ;)），Manning 博士的回答是最合适的。是这里-

有一个句子拆分器： edu.stanford.nlp.process.DocumentPreprocessor 。尝试命令：java edu.stanford.nlp.process.DocumentPreprocessor /u/nlp/data/lexparser/textDocument.txt

oneTokenizedSentencePerLine.txt 。（这是通过（良好但启发式的）FSM 完成的，因此速度很快；您没有运行概率解析器。）

但是如果我们修改命令 java edu.stanford.nlp.process.DocumentPreprocessor /u/nlp/data/lexparser/textDocument.txt > oneTokenizedSentencePerLine.txt TO java edu.stanford.nlp.process.DocumentPreprocessor -file /u有一点建议/nlp/data/lexparser/textDocument.txt > oneTokenizedSentencePerLine.txt。它可以正常工作，因为您需要指定作为输入呈现的文件类型。所以 -file 用于文本文件，-html 用于 HTML 等。

score 1 · Accepted Answer

我没有尝试过，但如果英语是您唯一关心的语言，我建议您看看Lingua::EN::Readability。

Lingua::EN::Readability 是一个 Ruby 模块，用于计算英文文本的统计信息。它可以提供单词、句子和音节的计数。它还可以计算几个可读性度量，例如雾指数和 Flesch-Kincaid 级别。该软件包包括模块Lingua::EN::Sentence，它将英文文本分解为注意缩写的句子，以及Lingua::EN::Syllable，它可以猜测书面英语单词的音节数。如果有发音词典，它可以查找词典中的音节数以获得更高的准确性

你想要的位sentence.rb如下：

module Lingua
module EN
# The module Lingua::EN::Sentence takes English text, and attempts to split it
# up into sentences, respecting abbreviations.

module Sentence
  EOS = "\001" # temporary end of sentence marker

  Titles   = [ 'jr', 'mr', 'mrs', 'ms', 'dr', 'prof', 'sr', 'sen', 'rep', 
         'rev', 'gov', 'atty', 'supt', 'det', 'rev', 'col','gen', 'lt', 
         'cmdr', 'adm', 'capt', 'sgt', 'cpl', 'maj' ]

  Entities = [ 'dept', 'univ', 'uni', 'assn', 'bros', 'inc', 'ltd', 'co', 
         'corp', 'plc' ]

  Months   = [ 'jan', 'feb', 'mar', 'apr', 'may', 'jun', 'jul', 
         'aug', 'sep', 'oct', 'nov', 'dec', 'sept' ]

  Days     = [ 'mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun' ]

  Misc     = [ 'vs', 'etc', 'no', 'esp', 'cf' ]

  Streets  = [ 'ave', 'bld', 'blvd', 'cl', 'ct', 'cres', 'dr', 'rd', 'st' ]

  @@abbreviations = Titles + Entities + Months + Days + Streets + Misc

  # Split the passed text into individual sentences, trim these and return
  # as an array. A sentence is marked by one of the punctuation marks ".", "?"
  # or "!" followed by whitespace. Sequences of full stops (such as an
  # ellipsis marker "..." and stops after a known abbreviation are ignored.
  def Sentence.sentences(text)

    text = text.dup

    # initial split after punctuation - have to preserve trailing whitespace
    # for the ellipsis correction next
    # would be nicer to use look-behind and look-ahead assertions to skip
    # ellipsis marks, but Ruby doesn't support look-behind
    text.gsub!( /([\.?!](?:\"|\'|\)|\]|\})?)(\s+)/ ) { $1 << EOS << $2 }

    # correct ellipsis marks and rows of stops
    text.gsub!( /(\.\.\.*)#{EOS}/ ) { $1 }

    # correct abbreviations
    # TODO - precompile this regex?
    text.gsub!( /(#{@@abbreviations.join("|")})\.#{EOS}/i ) { $1 << '.' }

    # split on EOS marker, strip gets rid of trailing whitespace
    text.split(EOS).map { | sentence | sentence.strip }
  end

  # add a list of abbreviations to the list that's used to detect false
  # sentence ends. Return the current list of abbreviations in use.
  def Sentence.abbreviation(*abbreviations)
    @@abbreviations += abbreviations
    @@abbreviations
  end
end
end
end

score 0 · Accepted Answer

我不是一个 Ruby 人，而是一个分裂的 RegEx

 ^(Mr|Mrs|Ms|Mme|Sta|Sr|Sra|Dr|U\.S\.A)[\.\!\?\"] [A-Z]

将是我最好的选择，一旦你得到了段落（在 \r\n 上分割）。这假设您的句子大小写正确。

显然这是一个相当丑陋的正则表达式。如何在句子之间强制两个空格

score 0 · Accepted Answer

在句点后面加上一个空格和一个大写字母不会适用于“布朗先生”这样的标题。

句号使事情变得困难，但一个容易处理的情况是感叹号和问号。但是，有些情况会使这不起作用。即雅虎的公司名称！

score 0 · Accepted Answer

好吧显然paragraph.split('.')不会削减它

#split将采用正则表达式作为答案，因此您可以尝试使用零宽度的lookbehind 来检查以大写字母开头的单词。当然，这会在专有名词上分裂，因此您可能不得不求助于这样的正则表达式/(Mr\.|Mrs\.|U\.S\.A ...)，除非您以编程方式构建正则表达式，否则这将非常难看。

score 0 · Accepted Answer

我认为这并不总是可解决的，但您可以根据“。”（后面的句号和空格）进行拆分，并验证句点之前的单词不在 Mr、Dr 等单词列表中。

但是，当然，您的列表可能会遗漏一些单词，在这种情况下，您会得到不好的结果。

ruby - 如何将一段文本解析成句子？（最好在 Ruby 中）

14 回答 14

Related

Reference