基本上,您需要首先将文本块分成句子。这很棘手,即使在英语中也是如此,因为您需要注意句号、问号、感叹号和任何其他句子终止符。
然后在删除所有标点符号(逗号、分号、冒号等)后一次处理一个句子。
然后,当您留下一组单词时,它变得更简单:
for i = 1 to num_words-1:
for j = i+1 to num_words:
phrase = words[i through j inclusive]
store phrase
就是这样,非常简单(在对文本块进行初始按摩之后,这可能不像您想象的那么简单)。
这将为您提供每个句子中包含两个或多个单词的所有短语。
分句、分词、去除标点符号等将是最难的部分,但我已经向您展示了一些简单的初始规则。每次文本块破坏算法时,都应添加其余部分。
更新:
根据要求,这里有一些给出短语的 Java 代码:
public class testme {
public final static String text =
"My username is click upvote." +
" I have 4k rep on stackoverflow.";
public static void procSentence (String sent) {
System.out.println ("==========");
System.out.println ("sentence [" + sent + "]");
// Split sentence at whitspace into array.
String [] sa = sent.split("\\s+");
// Process each starting word.
for (int i = 0; i < sa.length - 1; i++) {
// Process each phrase.
for (int j = i+1; j < sa.length; j++) {
// Build the phrase.
String phrase = sa[i];
for (int k = i+1; k <= j; k++) {
phrase = phrase + " " + sa[k];
}
// This is where you have your phrase. I just
// print it out but you can do whatever you
// wish with it.
System.out.println (" " + phrase);
}
}
}
public static void main(String[] args) {
// This is the block of text to process.
String block = text;
System.out.println ("block [" + block + "]");
// Keep going until no more sentences.
while (!block.equals("")) {
// Remove leading spaces.
if (block.startsWith(" ")) {
block = block.substring(1);
continue;
}
// Find end of sentence.
int pos = block.indexOf('.');
// Extract sentence and remove it from text block.
String sentence = block.substring(0,pos);
block = block.substring(pos+1);
// Process the sentence (this is the "meat").
procSentence (sentence);
System.out.println ("block [" + block + "]");
}
System.out.println ("==========");
}
}
输出:
block [My username is click upvote. I have 4k rep on stackoverflow.]
==========
sentence [My username is click upvote]
My username
My username is
My username is click
My username is click upvote
username is
username is click
username is click upvote
is click
is click upvote
click upvote
block [ I have 4k rep on stackoverflow.]
==========
sentence [I have 4k rep on stackoverflow]
I have
I have 4k
I have 4k rep
I have 4k rep on
I have 4k rep on stackoverflow
have 4k
have 4k rep
have 4k rep on
have 4k rep on stackoverflow
4k rep
4k rep on
4k rep on stackoverflow
rep on
rep on stackoverflow
on stackoverflow
block []
==========
现在,请记住这是非常基本的 Java(有些人可能会说它是用 Java 方言编写的 C :-)。它只是为了说明如何根据您的要求从句子中输出单词分组。
它并没有完成我在原始答案中提到的所有花哨的句子检测和标点删除。