**Barack Hussein Obama** II is the 44th and current President of the United States, and the first African American to hold the office.
我不想使用 OpenNLP API,因为它可能无法识别所有命名实体。有没有办法使用其他服务生成这样的 n-gram,或者可能是一种将所有名词术语组合在一起的方法。
**Barack Hussein Obama** II is the 44th and current President of the United States, and the first African American to hold the office.
我不想使用 OpenNLP API,因为它可能无法识别所有命名实体。有没有办法使用其他服务生成这样的 n-gram,或者可能是一种将所有名词术语组合在一起的方法。
如果你想避免使用 NER,你可以使用句子分块器或解析器。这将一般地提取名词短语。OpenNLP 有一个句子分块器和解析器,但如果你出于某种原因反对使用 OpenNLP,你可以尝试其他的。如果您对使用 OpenNLP 分块器感兴趣,我将发布一些使用 OpenNLP 提取名词短语的代码。
这是代码。您需要在此处从 sourceforge 下载模型
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;
* Extracts noun phrases from a sentence. To create sentences using OpenNLP use
* the SentenceDetector classes.
public class OpenNLPNounPhraseExtractor {
static final int N = 2;
public static void main(String[] args) {
try {
String modelPath = "c:\\temp\\opennlpmodels\\";
TokenizerModel tm = new TokenizerModel(new FileInputStream(new File(modelPath + "en-token.zip")));
TokenizerME wordBreaker = new TokenizerME(tm);
POSModel pm = new POSModel(new FileInputStream(new File(modelPath + "en-pos-maxent.zip")));
POSTaggerME posme = new POSTaggerME(pm);
InputStream modelIn = new FileInputStream(modelPath + "en-chunker.zip");
ChunkerModel chunkerModel = new ChunkerModel(modelIn);
ChunkerME chunkerME = new ChunkerME(chunkerModel);
//this is your sentence
String sentence = "Barack Hussein Obama II is the 44th and current President of the United States, and the first African American to hold the office.";
//words is the tokenized sentence
String[] words = wordBreaker.tokenize(sentence);
//posTags are the parts of speech of every word in the sentence (The chunker needs this info of course)
String[] posTags = posme.tag(words);
//chunks are the start end "spans" indices to the chunks in the words array
Span[] chunks = chunkerME.chunkAsSpans(words, posTags);
//chunkStrings are the actual chunks
String[] chunkStrings = Span.spansToStrings(chunks, words);
for (int i = 0; i < chunks.length; i++) {
if (chunks[i].getType().equals("NP")) {
System.out.println("NP: \n\t" + chunkStrings[i]);
String[] split = chunkStrings[i].split(" ");
List<String> ngrams = ngram(Arrays.asList(split), N, " ");
for (String gram : ngrams) {
System.out.println("\t" + gram);
} catch (IOException e) {
public static List<String> ngram(List<String> input, int n, String separator) {
if (input.size() <= n) {
return input;
List<String> outGrams = new ArrayList<String>();
for (int i = 0; i < input.size() - (n - 2); i++) {
String gram = "";
if ((i + n) <= input.size()) {
for (int x = i; x < (n + i); x++) {
gram += input.get(x) + separator;
gram = gram.substring(0, gram.lastIndexOf(separator));
return outGrams;
我用你的句子得到的输出是这个(N 设置为 2(bigram)
Barack Hussein Obama II
Barack Hussein
Hussein Obama
Obama II
the 44th and current President
the 44th
44th and
and current
current President
the United States
the United
United States
the first African American
the first
first African
African American
the office
这并没有明确处理形容词落在 NP 之外的情况......如果是这样,您可以从 POS 标签中获取此信息并将其整合。我给你的东西应该把你送到正确的方向。