如何使用斯坦福解析器将文本或段落拆分为句子?
是否有任何可以提取句子的方法,例如为RubygetSentencesFromString()
提供的方法?
您可以检查 DocumentPreprocessor 类。下面是一个简短的片段。我认为可能还有其他方法可以做你想做的事。
String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
List<String> sentenceList = new ArrayList<String>();
for (List<HasWord> sentence : dp) {
// SentenceUtils not Sentence
String sentenceString = SentenceUtils.listToString(sentence);
sentenceList.add(sentenceString);
}
for (String sentence : sentenceList) {
System.out.println(sentence);
}
我知道已经有一个公认的答案......但通常你只需从带注释的文档中获取 SentenceAnnotations 。
// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
Properties props = new Properties();
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable
String text = ... // Add your text here!
// create an empty Annotation just with the given text
Annotation document = new Annotation(text);
// run all Annotators on this text
pipeline.annotate(document);
// these are all the sentences in this document
// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types
List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) {
// traversing the words in the current sentence
// a CoreLabel is a CoreMap with additional token-specific methods
for (CoreLabel token: sentence.get(TokensAnnotation.class)) {
// this is the text of the token
String word = token.get(TextAnnotation.class);
// this is the POS tag of the token
String pos = token.get(PartOfSpeechAnnotation.class);
// this is the NER label of the token
String ne = token.get(NamedEntityTagAnnotation.class);
}
}
来源 - http://nlp.stanford.edu/software/corenlp.shtml(中途)
如果你只是在寻找句子,你可以从管道初始化中去掉后面的步骤,比如“parse”和“dcoref”,这样可以节省一些加载和处理时间。摇滚乐。〜K
接受的答案有几个问题。首先,分词器将一些字符,例如字符“转换为两个字符”。其次,将标记化的文本与空格重新连接在一起不会返回与以前相同的结果。因此,来自已接受答案的示例文本以非平凡的方式转换输入文本。
但是,CoreLabel
标记器使用的类会跟踪它们映射到的源字符,因此如果您有原始字符串,重建正确的字符串是微不足道的。
下面的方法 1 显示了公认的答案方法,方法 2 显示了我的方法,它克服了这些问题。
String paragraph = "My 1st sentence. “Does it work for questions?” My third sentence.";
List<String> sentenceList;
/* ** APPROACH 1 (BAD!) ** */
Reader reader = new StringReader(paragraph);
DocumentPreprocessor dp = new DocumentPreprocessor(reader);
sentenceList = new ArrayList<String>();
for (List<HasWord> sentence : dp) {
sentenceList.add(Sentence.listToString(sentence));
}
System.out.println(StringUtils.join(sentenceList, " _ "));
/* ** APPROACH 2 ** */
//// Tokenize
List<CoreLabel> tokens = new ArrayList<CoreLabel>();
PTBTokenizer<CoreLabel> tokenizer = new PTBTokenizer<CoreLabel>(new StringReader(paragraph), new CoreLabelTokenFactory(), "");
while (tokenizer.hasNext()) {
tokens.add(tokenizer.next());
}
//// Split sentences from tokens
List<List<CoreLabel>> sentences = new WordToSentenceProcessor<CoreLabel>().process(tokens);
//// Join back together
int end;
int start = 0;
sentenceList = new ArrayList<String>();
for (List<CoreLabel> sentence: sentences) {
end = sentence.get(sentence.size()-1).endPosition();
sentenceList.add(paragraph.substring(start, end).trim());
start = end;
}
System.out.println(StringUtils.join(sentenceList, " _ "));
这输出:
My 1st sentence . _ `` Does it work for questions ? '' _ My third sentence .
My 1st sentence. _ “Does it work for questions?” _ My third sentence.
使用 .net C# 包:这将拆分句子,使括号正确并保留原始空格和标点符号:
public class NlpDemo
{
public static readonly TokenizerFactory TokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(),
"normalizeParentheses=false,normalizeOtherBrackets=false,invertible=true");
public void ParseFile(string fileName)
{
using (var stream = File.OpenRead(fileName))
{
SplitSentences(stream);
}
}
public void SplitSentences(Stream stream)
{
var preProcessor = new DocumentPreprocessor(new UTF8Reader(new InputStreamWrapper(stream)));
preProcessor.setTokenizerFactory(TokenizerFactory);
foreach (java.util.List sentence in preProcessor)
{
ProcessSentence(sentence);
}
}
// print the sentence with original spaces and punctuation.
public void ProcessSentence(java.util.List sentence)
{
System.Console.WriteLine(edu.stanford.nlp.util.StringUtils.joinWithOriginalWhiteSpace(sentence));
}
}
输入: - 这句话的人物具有一定的魅力,这种魅力经常出现在标点符号和散文中。这是第二句?真的是。
输出: 3 个句子(“?”被认为是句尾分隔符)
注意:对于像“Havisham 夫人的课程在各个方面都无可挑剔(就人们所见!)”这样的句子。分词器会正确识别出 Mrs. 结尾的句号不是 EOS,但它会错误地标记 ! 在括号内作为 EOS 并拆分“在所有方面”。作为第二句话。
使用Stanford CoreNLP 3.6.0 或 3.7.0 版提供的Simple API 。
这是 3.6.0 的示例。它与 3.7.0 完全相同。
Java 代码片段
import java.util.List;
import edu.stanford.nlp.simple.Document;
import edu.stanford.nlp.simple.Sentence;
public class TestSplitSentences {
public static void main(String[] args) {
Document doc = new Document("The text paragraph. Another sentence. Yet another sentence.");
List<Sentence> sentences = doc.sentences();
sentences.stream().forEach(System.out::println);
}
}
产量:
文本段落。
另一个句子。
又是一句话。
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>stanfordcorenlp</groupId>
<artifactId>stanfordcorenlp</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<!-- https://mvnrepository.com/artifact/edu.stanford.nlp/stanford-corenlp -->
<dependency>
<groupId>edu.stanford.nlp</groupId>
<artifactId>stanford-corenlp</artifactId>
<version>3.6.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/com.google.protobuf/protobuf-java -->
<dependency>
<groupId>com.google.protobuf</groupId>
<artifactId>protobuf-java</artifactId>
<version>2.6.1</version>
</dependency>
</dependencies>
</project>
您可以使用文档预处理器。这真的很容易。只需给它一个文件名。
for (List<HasWord> sentence : new DocumentPreprocessor(pathto/filename.txt)) {
//sentence is a list of words in a sentence
}
你可以很容易地为此使用斯坦福标记器。
String text = new String("Your text...."); //Your own text.
List<List<HasWord>> tokenizedSentences = MaxentTagger.tokenizeText(new StringReader(text));
for(List<CoreLabel> act : tokenizedSentences) //Travel trough sentences
{
System.out.println(edu.stanford.nlp.ling.Sentence.listToString(act)); //This is your sentence
}
将解决问题的@Kevin 答案的变体如下:
for(CoreMap sentence: sentences) {
String sentenceText = sentence.get(TextAnnotation.class)
}
它可以在不打扰其他注释器的情况下为您提供句子信息。
除了一些被否决的答案外,另一个没有解决的元素是如何设置句子分隔符?最常见的方式(默认方式)是依赖于表示句子结尾的常见标点符号。从收集的语料库中提取可能会面临其他文档格式,其中一种是每一行都是它自己的句子。
要在接受的答案中设置 DocumentPreprocessor 的分隔符,您可以使用setSentenceDelimiter(String)
. 要使用@Kevin 回答中建议的管道方法,可以使用 ssplit 属性。例如,要使用上一段中提出的行尾方案,可以将属性设置ssplit.eolonly
为true
在下面的代码中添加输入和输出文件的路径:-
import java.util.*;
import edu.stanford.nlp.pipeline.*;
import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
public class NLPExample
{
public static void main(String[] args) throws IOException
{
PrintWriter out;
out = new PrintWriter("C:\\Users\\ACER\\Downloads\\stanford-corenlp-full-
2018-02-27\\output.txt");
Properties props=new Properties();
props.setProperty("annotators","tokenize, ssplit, pos,lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
Annotation annotation;
String readString = null;
PrintWriter pw = null;
BufferedReader br = null;
br = new BufferedReader (new
FileReader("C:\\Users\\ACER\\Downloads\\stanford-
corenlp-full-2018-02-27\\input.txt" ) ) ;
pw = new PrintWriter ( new BufferedWriter ( new FileWriter (
"C:\\Users\\ACER\\Downloads\\stanford-corenlp-full-2018-02-
27\\output.txt",false
))) ;
String x = null;
while (( readString = br.readLine ()) != null)
{
pw.println ( readString ) ; String
xx=readString;x=xx;//System.out.println("OKKKKK");
annotation = new Annotation(x);
pipeline.annotate(annotation); //System.out.println("LamoohAKA");
pipeline.prettyPrint(annotation, out);
}
br.close ( ) ;
pw.close ( ) ;
System.out.println("Done...");
}
}
public class k {
public static void main(String a[]){
String str = "This program splits a string based on space";
String[] words = str.split(" ");
for(String s:words){
System.out.println(s);
}
str = "This program splits a string based on space";
words = str.split("\\s+");
}
}
使用正则表达式将文本拆分为句子,使用正则表达式,但在 java 中我不知道。
代码
string[] 句子 = Regex.Split(text, @"(?<=['""a-za-z][\)][\.\!\?])\s+(?=[AZ])" );
90% 有效