我正在阅读 相似度测量 ,突然间我的整个世界都分崩离析了。我已经使用聚类技术实现了一个搜索引擎。对于聚类,我使用了 K 均值,其距离度量为欧几里德距离。我还使用余弦相似度来显示结果。我得到了惊人的准确结果。但是现在我读到了这篇文章,我所做的是规范化文档向量并计算两个向量之间的欧几里德距离,因此我没有考虑任何地方的幅度。
难道我做错了什么 ?
虽然我认为更高的词频会弥补更高的 tf-idf 值和更高的标准化 tf-idf 值,因此会适当地排名靠前。谢谢
结果(使用未归一化的向量,这些数字是欧几里得距离)
61.79689257425985 222Proposed Research Details.doc
144.15451315901478 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc
72.61392308146608 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc
72.96125277156261 done_Management strategies for impriing rabi (SKN Math).doc
65.51734241367222 done_RPFIII_dr.dogra.doc
66.72042766100921 Evaluation of crops and their varieties (SKN Math).doc
418.8868087170988 P. VIJAYA KUMAR (DSS).doc
140.3914521621597 RPF - I PIMS-ICAR project proposal for IASRI.doc
72.95414421468679 RPF-III__Indo-US_project.doc
82.25126123574397 220Introduction and objectives.doc
结果(使用归一化向量,数字是欧几里得距离)
1.3435369899385359 222Proposed Research Details.doc
1.1277471087250086 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc
1.2741267093494966 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc
1.264154265747389 done_Management strategies for impriing rabi (SKN Math).doc
1.2902191708899362 done_RPFIII_dr.dogra.doc
1.3128744973475515 Evaluation of crops and their varieties (SKN Math).doc
0.4924243033927417 P. VIJAYA KUMAR (DSS).doc
1.1747048933792805 RPF - I PIMS-ICAR project proposal for IASRI.doc
1.29150899172647 RPF-III__Indo-US_project.doc
1.318016051789028 220Introduction and objectives.doc
结果(数字为余弦相似度)
0.09745417833344654 222Proposed Research Details.doc
0.36409322938119104 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc
0.1883005642611103 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc
0.2009569961963377 done_Management strategies for impriing rabi (SKN Math).doc
0.16766724553404047 done_RPFIII_dr.dogra.doc
0.13818027710720598 Evaluation of crops and their varieties (SKN Math).doc
0.8787591527140649 P. VIJAYA KUMAR (DSS).doc
0.3100342067353838 RPF - I PIMS-ICAR project proposal for IASRI.doc
0.16600226214483405 RPF-III__Indo-US_project.doc
0.13141684361322944 220Introduction and objectives.doc
结果 1 和 2 彼此不一致,而 2 和 3 非常一致。相似度更高,距离更小。在聚类质心向量和每个文档的文档向量之间获取距离。
事实上,最奇怪的结果是欧几里得距离为 418 且相似度最高为 0.87 的文档。而归一化距离变为 0.49 并且与相似性一致。