2

我正在阅读 相似度测量 ,突然间我的整个世界都分崩离析了。我已经使用聚类技术实现了一个搜索引擎。对于聚类,我使用了 K 均值,其距离度量为欧几里德距离。我还使用余弦相似度来显示结果。我得到了惊人的准确结果。但是现在我读到了这篇文章,我所做的是规范化文档向量并计算两个向量之间的欧几里德距离,因此我没有考虑任何地方的幅度。

难道我做错了什么 ?

虽然我认为更高的词频会弥补更高的 tf-idf 值和更高的标准化 tf-idf 值,因此会适当地排名靠前。谢谢

结果(使用未归一化的向量,这些数字是欧几里得距离)

61.79689257425985 222Proposed Research Details.doc
144.15451315901478 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc
72.61392308146608 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc
72.96125277156261 done_Management strategies for impriing rabi (SKN Math).doc
65.51734241367222 done_RPFIII_dr.dogra.doc
66.72042766100921 Evaluation of crops and their varieties (SKN Math).doc
418.8868087170988 P. VIJAYA KUMAR (DSS).doc
140.3914521621597 RPF - I PIMS-ICAR project proposal for IASRI.doc
72.95414421468679 RPF-III__Indo-US_project.doc
82.25126123574397 220Introduction and objectives.doc

结果(使用归一化向量,数字是欧几里得距离)

1.3435369899385359 222Proposed Research Details.doc
1.1277471087250086 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc
1.2741267093494966 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc
1.264154265747389 done_Management strategies for impriing rabi (SKN Math).doc
1.2902191708899362 done_RPFIII_dr.dogra.doc
1.3128744973475515 Evaluation of crops and their varieties (SKN Math).doc
0.4924243033927417 P. VIJAYA KUMAR (DSS).doc
1.1747048933792805 RPF - I PIMS-ICAR project proposal for IASRI.doc
1.29150899172647 RPF-III__Indo-US_project.doc
1.318016051789028 220Introduction and objectives.doc

结果(数字为余弦相似度)

0.09745417833344654 222Proposed Research Details.doc
0.36409322938119104 and_Integrated_Assessment_of__Natural_resources_and_evolution_of_alternate_sustainable_land_management_options_for_tribal_dominated_watersheds_RRPS_24.doc
0.1883005642611103 done_Developing live fencing systems for soil & water conservation_NATIP-RNPS-3 SKN Math).doc
0.2009569961963377 done_Management strategies for impriing rabi (SKN Math).doc
0.16766724553404047 done_RPFIII_dr.dogra.doc
0.13818027710720598 Evaluation of crops and their varieties (SKN Math).doc
0.8787591527140649 P. VIJAYA KUMAR (DSS).doc
0.3100342067353838 RPF - I PIMS-ICAR project proposal for IASRI.doc
0.16600226214483405 RPF-III__Indo-US_project.doc
0.13141684361322944 220Introduction and objectives.doc

结果 1 和 2 彼此不一致,而 2 和 3 非常一致。相似度更高,距离更小。在聚类质心向量和每个文档的文档向量之间获取距离。

事实上,最奇怪的结果是欧几里得距离为 418 且相似度最高为 0.87 的文档。而归一化距离变为 0.49 并且与相似性一致。

4

1 回答 1

0

当我从我的信息检索讲座中记起时,对两个向量进行归一化会导致欧几里得距离和余弦相似度的排序顺序相反。

于 2013-11-12T18:29:01.660 回答