2

I want to develop a sentence matching engine. The engine will generate a result which is the most matching sentence with the input. Even if it is a least match, the engine must generate an output from the data set( a text file with a lot of sentences).

eg: Input :
Hello I am Nidhin Joseph
Data set:
1). Hello, How are you?
2). And I am Nidhin.
3). I am Nidhin Joseph Hello.
Among these three, according to my requirements, the most matching one is the third sentence. I am ranking on the basis of both word hit and word order.
My input : {"Hello","I","am","Nidhin","Joseph"}
My output : {"I","am","Nidhin","Joseph","Hello"}

Here no of word hits= 4
No of relatively ordered words = 4
I don't know whether i was able to convey you my idea. If i made it, then please tell me, if a similar library is already available in Java. If not, please lead me in some right direction so that i can develop it in an easier way.

4

1 回答 1

1

I suggest the Levenshtein distance algorithm. You could use the standard algorithm on the entire sentence, treating it as a long string of characters (including the blanks and punctuation).

Depending on your requirements you could try some variations like running the Porter stemmer on all the words or ignoring the punctuation. You could even modify the Levenshtein algorithm to use words as its atoms instead of characters.

于 2013-11-01T05:55:49.810 回答