java - 如何在文本文档中查找经常出现的短语

Question

我有一个包含多个段落的文本文档。我需要一起找到经常出现的短语。

例如

患者姓名xyz 电话号码 12345 emailid xyz@abc.com 患者姓名abc 地址一些我们的地址

比较这些行，常用短语是患者姓名。现在我可以在段落中的任何位置使用该短语。现在我的要求是使用 nlp 找到文档中出现频率最高的短语，而不管其位置如何。

score 0 · Accepted Answer

您应该为此使用n-grams，因此您只需计算连续n单词序列出现的次数。因为你不知道会重复多少个单词，你可以尝试几个nfor n-grams，即。从 2 到 6。

Java ngrams 示例测试JDK 1.8.0：

import java.util.*;

public class NGramExample{

    public static HashMap<String, Integer> ngrams(String text, int n) {
        ArrayList<String> words = new ArrayList<String>();
        for(String word : text.split(" ")) {
            words.add(word);
        }

        HashMap<String, Integer> map = new HashMap<String, Integer>();

        int c = words.size();
        for(int i = 0; i < c; i++) {
            if((i + n - 1) < c) {
                int stop = i + n;
                String ngramWords = words.get(i);

                for(int j = i + 1; j < stop; j++) {
                    ngramWords +=" "+ words.get(j);
                }
                map.merge(ngramWords, 1, Integer::sum);
            }
        }

        return map;
    }

     public static void main(String []args){
        System.out.println("Ngrams: ");
        HashMap<String, Integer> res = ngrams("Patient name xyz phone no 12345 emailid xyz@abc.com. Patient name abc address some us address", 2);
        for (Map.Entry<String, Integer> entry : res.entrySet()) {
            System.out.println(entry.getKey() + ":" + entry.getValue().toString());
        }
     }
}

输出：

Ngrams: 
name abc:1
xyz@abc.com. Patient:1
emailid xyz@abc.com.:1
phone no:1
12345 emailid:1
Patient name:2
xyz phone:1
address some:1
us address:1
name xyz:1
some us:1
no 12345:1
abc address:1

因此，您会看到“患者姓名”的最大计数是 2 次。您可以将此函数与多个n值一起使用并检索最大出现次数。

编辑：出于历史原因，我将把这段 Python 代码留在这里。

一个简单的 Python（使用nltk）工作示例向您展示我的意思：

from nltk import ngrams
from collections import Counter

paragraph = 'Patient name xyz phone no 12345 emailid xyz@abc.com. Patient name abc address some us address'
n = 2
words = paragraph.split(' ') # of course you should split sentences in a better way
bigrams = ngrams(words, n)
c = Counter(bigrams)
c.most_common()[0]

这为您提供了输出：

>> (('Patient', 'name'), 2)

java - 如何在文本文档中查找经常出现的短语

1 回答 1

Related

Reference