java - 与 NLP 的句子比较

Question

我使用 lingpipe 进行句子检测，但我不知道是否有更好的工具。据我了解，没有办法比较两个句子，看看它们是否意思相同。

有没有其他好的资源可以让我有一个预先构建的方法来比较两个句子并查看它们是否相似？

我的要求如下：

String sent1 = "Mary and Meera are my classmates.";

String sent2 = "Meera and Mary are my classmates.";

String sent3 = "I am in Meera and Mary's class.";

// several sentences will be formed and basically what I need to do is
// this

boolean bothAreEqual = compareOf(sent1, sent2);

sop(bothAreEqual); // should print true

boolean bothAreEqual = compareOf(sent2, sent3);

sop(bothAreEqual);// should print true

score 3 · Accepted Answer

如何测试两个句子的含义是否相同：这将是一个过于开放的问题。

但是，有一些方法可以比较两个句子，看看它们是否相似。有许多可能的相似性定义，可以使用预先构建的方法进行测试。

参见例如http://en.wikipedia.org/wiki/Levenshtein_distance

Distance between 
'Mary and Meera are my classmates.' 
and 'Meera and Mary are my classmates.': 
6
Distance between 
'Mary and Meera are my classmates.' 
and 'Alice and Bobe are not my classmates.': 
14
Distance between 
'Mary and Meera are my classmates.' 
and 'Some totally different sentence.': 
29

代码：

public class LevenshteinDistance {

    private static int minimum(int a, int b, int c) {
        return Math.min(Math.min(a, b), c);
    }

    public static int computeDistance(CharSequence str1,
            CharSequence str2) {

        int[][] distance = new int[str1.length() + 1][str2.length() + 1];

        for (int i = 0; i <= str1.length(); i++){
            distance[i][0] = i;
        }
        for (int j = 0; j <= str2.length(); j++){
            distance[0][j] = j;
        }
        for (int i = 1; i <= str1.length(); i++){
            for (int j = 1; j <= str2.length(); j++){
                distance[i][j] = minimum(
                    distance[i - 1][j] + 1,
                    distance[i][j - 1] + 1,
                    distance[i - 1][j - 1]
                        + ((str1.charAt(i - 1) == str2.charAt(j - 1)) ? 0 : 1));
            }
        }
        int result = distance[str1.length()][str2.length()];
        //log.debug("distance:"+result);
        return result;
    }


    public static void main(String[] args) {
        String sent1="Mary and Meera are my classmates.";
        String sent2="Meera and Mary are my classmates.";       
        String sent3="Alice and Bobe are not my classmates.";
        String sent4="Some totally different sentence.";

    System.out.println("Distance between \n'"+sent1+"' \nand '"+sent2+"': \n"+computeDistance(sent1, sent2));
    System.out.println("Distance between \n'"+sent1+"' \nand '"+sent3+"': \n"+computeDistance(sent1, sent3));
    System.out.println("Distance between \n'"+sent1+"' \nand '"+sent4+"': \n"+computeDistance(sent1, sent4));

        }
}

score 2 · Accepted Answer

这是我想出的。这只是一个替代品，直到我得到真正的东西，但它可能对那里的人有一些帮助..

package com.examples;

import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import com.aliasi.sentences.MedlineSentenceModel;
import com.aliasi.sentences.SentenceModel;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.util.Files;
import com.sun.accessibility.internal.resources.accessibility;

public class SentenceWordAnalysisAndLevenshteinDistance {
private static int minimum(int a, int b, int c) {
    return Math.min(Math.min(a, b), c);
}

public static int computeDistance(CharSequence str1, CharSequence str2) {
    int[][] distance = new int[str1.length() + 1][str2.length() + 1];
    for (int i = 0; i <= str1.length(); i++) {
        distance[i][0] = i;
    }
    for (int j = 0; j <= str2.length(); j++) {
        distance[0][j] = j;
    }
    for (int i = 1; i <= str1.length(); i++) {
        for (int j = 1; j <= str2.length(); j++) {
            distance[i][j] = minimum(
                    distance[i - 1][j] + 1,
                    distance[i][j - 1] + 1,
                    distance[i - 1][j - 1]
                            + ((str1.charAt(i -     1) == str2.charAt(j - 1)) ? 0
                                    : 1));
        }
    }
    int result = distance[str1.length()][str2.length()];

    return result;
}

static final TokenizerFactory TOKENIZER_FACTORY = IndoEuropeanTokenizerFactory.INSTANCE;
static final SentenceModel SENTENCE_MODEL = new MedlineSentenceModel();

public static void main(String[] args) {
    try {
        ArrayList<String> sentences = null;
        sentences = new ArrayList<String>();
        // Reading from text file
        // sentences = readSentencesInFile("D:\\sam.txt");

        // Giving sentences
        // ArrayList<String> sentences = new ArrayList<String>();
        sentences.add("Mary and Meera are my classmates.");
        sentences.add("Mary and Meera are my classmates.");
        sentences.add("Meera and Mary are my classmates.");
        sentences.add("Alice and Bobe are not my classmates.");
        sentences.add("Some totally different sentence.");
        // Self-implemented
        wordAnalyser(sentences);
        // Internet referred
        // levenshteinDistance(sentences);
    } catch (Exception e) {
        // TODO: handle exception
        e.printStackTrace();
    }
}

private static ArrayList<String> readSentencesInFile(String path) {
    ArrayList<String> sentencesList = new ArrayList<String>();

    try {
        System.out.println("Reading file from : " + path);
        File file = new File(path);
        String text = Files.readFromFile(file, "ISO-8859-1");
        System.out.println("INPUT TEXT: ");
        System.out.println(text);

        List<String> tokenList = new ArrayList<String>();
        List<String> whiteList = new ArrayList<String>();
        Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(
                text.toCharArray(), 0, text.length());
        tokenizer.tokenize(tokenList, whiteList);

        System.out.println(tokenList.size() + " TOKENS");
        System.out.println(whiteList.size() + " WHITESPACES");

        String[] tokens = new String[tokenList.size()];
        String[] whites = new String[whiteList.size()];
        tokenList.toArray(tokens);
        whiteList.toArray(whites);
        int[] sentenceBoundaries = SENTENCE_MODEL.boundaryIndices(tokens,
                whites);

        System.out.println(sentenceBoundaries.length
                + " SENTENCE END TOKEN OFFSETS");

        if (sentenceBoundaries.length < 1) {
            System.out.println("No sentence boundaries found.");
            return new ArrayList<String>();
        }
        int sentStartTok = 0;
        int sentEndTok = 0;
        for (int i = 0; i < sentenceBoundaries.length; ++i) {
            sentEndTok = sentenceBoundaries[i];
            System.out.println("SENTENCE " + (i + 1) + ": ");
            StringBuffer sentenceString = new StringBuffer();
            for (int j = sentStartTok; j <= sentEndTok; j++) {
                sentenceString.append(tokens[j] + whites[j + 1]);
            }
            System.out.println(sentenceString.toString());
            sentencesList.add(sentenceString.toString());
            sentStartTok = sentEndTok + 1;
        }
    } catch (IOException e) {
        // TODO: handle exception
        e.printStackTrace();
    }

    return sentencesList;
}

private static void levenshteinDistance(ArrayList<String> sentences) {
    System.out.println("\nLevenshteinDistance");
    for (int i = 0; i < sentences.size(); i++) {
        System.out.println("Distance between \n'" + sentences.get(0)
                + "' \nand '" + sentences.get(i) + "': \n"
                + computeDistance(sentences.get(0), 
sentences.get(i)));
    }
}

private static void wordAnalyser(ArrayList<String> sentences) {

    System.out.println("No.of Sentences : " + sentences.size());
    List<String> stopWordsList = getStopWords();
    List<String> tokenList = new ArrayList<String>();
    ArrayList<List<String>> filteredSentences = new ArrayList<List<String>>();

    for (int i = 0; i < sentences.size(); i++) {
        tokenList = new ArrayList<String>();
        List<String> whiteList = new ArrayList<String>();
        Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(sentences.get(i)
                .toCharArray(), 0, sentences.get(i).length());
        tokenizer.tokenize(tokenList, whiteList);

        System.out.print("Sentence " + (i + 1) + ": " + tokenList.size()
                + " TOKENS, ");
        System.out.println(whiteList.size() + " WHITESPACES");

        filteredSentences.add(filterStopWords(tokenList, stopWordsList));
    }

    for (int i = 0; i < sentences.size(); i++) {
        System.out.println("\n" + (i + 1) + ". Comparing\n      '"
                + sentences.get(0) + "' \nwith\n     '" +         
sentences.get(i)
                + "' : \n");
        System.out.println(filteredSentences.get(0) + "\n and \n"
                + filteredSentences.get(i));
        System.out.println("Percentage of similarity: "
                + calculateSimilarity(filteredSentences.get(0),
                        filteredSentences.get(i)) 
+ "%");
    }
}

private static double calculateSimilarity(List<String> list1,
        List<String> list2) {
    int length1 = list1.size();
    int length2 = list2.size();

    int count1 = 0;
    int count2 = 0;
    double result1 = 0.0;
    double result2 = 0.0;
    int least, highest;
    if (length2 > length1) {
        least = length1;
        highest = length2;
    } else {
        least = length2;
        highest = length1;
    }
    // computing result1
    for (String string1 : list1) {
        if (list2.contains(string1))
            count1++;
    }
    result1 = (count1 * 100) / length1;
    // computing result2
    for (String string2 : list2) {
        if (list1.contains(string2))
            count2++;
    }
    result2 = (count2 * 100) / length2;

    double avg = (result1 + result2) / 2;

    return avg;
}

private static List<String> getStopWords() {
    String stopWordsString = ".,a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your";
    List<String> stopWordsList = new ArrayList<String>();
    List<String> stopWordTokenList = new ArrayList<String>();
    List<String> whiteList = new ArrayList<String>();
    Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(
            stopWordsString.toCharArray(), 0, stopWordsString.length());
    tokenizer.tokenize(stopWordTokenList, whiteList);
    for (int i = 0; i < stopWordTokenList.size(); i++) {
        // System.out.println((i + 1) + ":" + tokenList.get(i));
        if (!stopWordTokenList.get(i).equals(",")) {
            stopWordsList.add(stopWordTokenList.get(i));
        }
    }
    System.out.println("No.of stop words: " + stopWordsList.size());
    return stopWordsList;
}

private static List<String> filterStopWords(List<String> tokenList,
        List<String> stopWordsList) {

    List<String> filteredSentenceWords = new ArrayList<String>();
    for (String sentenceToken : tokenList) {
        if (!stopWordsList.contains(sentenceToken)) {
            filteredSentenceWords.add(sentenceToken);
        }
    }
    return filteredSentenceWords;
}
}

java - 与 NLP 的句子比较

2 回答 2

Related

Reference