我有一个包含单词和短语混合的大型数据集,例如:
dataset = [
"car",
"red-car",
"lorry",
"broken lorry",
"truck owner",
"train",
...
]
我正在尝试找到一种方法来确定短句中最相似的单词,例如:
input = "I love my car that is red" # should map to "red-car"
input = "I purchased a new lorry" # should map to "lorry"
input = "I hate my redcar" # should map to "red-car"
input = "I will use my truck" # should map to "truck owner"
input = "Look at that yellow lorri" # should map to "lorry"
我尝试了多种方法都无济于事,包括:
dataset
向量化和使用TfidfVectorizer input
,然后计算向量化input
值与dataset
.
问题是,这只有在input
包含数据集中的确切单词时才真正有效 - 例如,在input = "trai"
它的余弦值为 0 的情况下,而我试图让它映射到"train"
数据集中的值。
最明显的解决方案是执行简单的拼写检查,但这可能不是一个有效的选项,因为我仍然想选择最相似的结果,即使单词略有不同,即:
input = "broke" # should map to "broken lorry" given the above dataset
如果有人可以建议我可以尝试的其他潜在方法,那将不胜感激。