0

是否有任何方法可以找到字符串的含义是否相似,,,即使字符串中的单词是有区别的

直到现在我尝试了模糊模糊,列文斯坦距离,余弦相似度来匹配字符串,但所有都匹配单词而不是单词的含义

Str1 = "what are types of negotiation"
Str2 = "what are advantages of negotiation"
Str3 = "what are categories of negotiation"
Ratio = fuzz.ratio(Str1.lower(),Str2.lower())
Partial_Ratio = fuzz.partial_ratio(Str1.lower(),Str2.lower())
Token_Sort_Ratio = fuzz.token_sort_ratio(Str1,Str2)
Ratio1 = fuzz.ratio(Str1.lower(),Str3.lower())
Partial_Ratio1 = fuzz.partial_ratio(Str1.lower(),Str3.lower())
Token_Sort_Ratio1 = fuzz.token_sort_ratio(Str1,Str3)
print("fuzzywuzzy")
print(Str1," ",Str2," ",Ratio)
print(Str1," ",Str2," ",Partial_Ratio)
print(Str1," ",Str2," ",Token_Sort_Ratio)
print(Str1," ",Str3," ",Ratio1)
print(Str1," ",Str3," ",Partial_Ratio1)
print(Str1," ",Str3," ",Token_Sort_Ratio1)
print("levenshtein ratio")
Ratio = levenshtein_ratio_and_distance(Str1,Str2,ratio_calc = True)
Ratio1 = levenshtein_ratio_and_distance(Str1,Str3,ratio_calc = True)
print(Str1," ",Str2," ",Ratio)
print(Str1," ",Str3," ",Ratio)

output:
fuzzywuzzy
what are types of negotiation   what are advantages of negotiation   86
what are types of negotiation   what are advantages of negotiation   76
what are types of negotiation   what are advantages of negotiation   73
what are types of negotiation   what are categories of negotiation   86
what are types of negotiation   what are categories of negotiation   76
what are types of negotiation   what are categories of negotiation   73
levenshtein ratio
what are types of negotiation   what are advantages of negotiation               
0.8571428571428571
what are types of negotiation   what are categories of negotiation       
0.8571428571428571



expected output:
"what are the types of negotiation skill?"
"what are the categories in negotiation skill?"
output:similar
"what are the types of negotiation skill?"
"what are the advantages of negotiation skill?"
output:not similar
4

1 回答 1

1

您想要对两个字符串的语义相似性进行评分。

Fuzzy-wuzzy 和 Levenshtein 距离仅对字符距离进行评分。

您需要考虑语义信息。因此,您需要字符串的语义表示。

也许一个简单但有效的方法包括:

  1. 使用您的语言的预训练词嵌入计算代表您的两个字符串的两个向量(例如 FastText - get_sentence_vector https://fasttext.cc/docs/en/python-module.html#model-object
  2. 计算两个向量之间的余弦相似度(1:相等的字符串;0:真正不同的字符串)。

当然,还有更好、更复杂的方法。为了深入理解这个主题,我建议这篇文章(https://medium.com/@adriensieg/text-similarities-da019229c894),其中包含丰富的解释和代码实现。

于 2019-09-24T11:53:08.213 回答