我想distance
在我的程序中实现 word2vec 的一部分。不幸的是,它不在 C/C++ 或 Python 中,但首先我不理解非二进制表示。这就是我获取文件的方式
./word2vec -train text8-phrase -output vectorsphrase.txt -cbow 0 -size 300 -window 10 -negative 0 -hs 1 -sample 1e-3 -threads 12 -binary 0
当我检查法国的vectorsphrase.txt文件时,我得到的是:
france -0.062591 0.264201 0.236335 -0.072601 -0.094313 -0.202659 -0.373314 0.074684 -0.262307 0.139383 -0.053648 -0.154181 0.126962 0.432593 -0.039440 0.108096 0.083703 0.148991 0.062826 0.048151 0.005555 0.066885 0.004729 -0.013939 -0.043947 0.057280 -0.005259 -0.223302 0.065608 -0.013932 -0.199372 -0.054966 -0.026725 0.012510 0.076350 -0.027816 -0.187357 0.248191 -0.085087 0.172979 -0.116789 0.014136 0.131571 0.173892 0.316052 -0.045492 0.057584 0.028944 -0.193623 0.043965 -0.166696 0.111058 0.145268 -0.119645 0.091659 0.056593 0.417965 -0.002927 -0.081579 -0.021356 0.030447 0.052507 -0.109058 -0.011124 -0.136975 0.104396 0.069319 0.030266 -0.193283 -0.024614 -0.025636 -0.100761 0.032366 0.069175 0.200204 -0.042976 -0.045123 -0.090475 0.090071 -0.037075 0.182373 0.151529 0.080198 -0.024067 -0.196623 -0.204863 0.154429 -0.190242 -0.063265 -0.323000 -0.109863 0.102366 -0.085017 0.198042 -0.033342 0.119225 0.176891 0.214628 0.031771 0.168739 0.063246 -0.147353 -0.003526 0.138835 -0.172460 -0.133294 -0.369451 0.063572 0.076098 -0.116277 0.208374 0.015783 0.145032 0.090530 -0.090470 0.109325 0.119184 0.024838 0.101194 -0.184154 -0.161648 -0.039929 0.079321 0.029462 -0.016193 -0.005485 0.197576 -0.118860 0.019042 -0.137174 -0.047933 -0.008472 0.092360 0.165395 0.013253 -0.099013 -0.017355 -0.048332 -0.077228 0.034320 -0.067505 -0.050190 -0.320440 -0.040684 -0.106958 -0.169634 -0.014216 0.225693 0.345064 0.135220 -0.181518 -0.035400 -0.095907 -0.084446 0.025784 0.090736 -0.150824 -0.351817 0.174671 0.091944 -0.112423 -0.140281 0.059532 0.002152 0.127812 0.090834 -0.130366 -0.061899 -0.280557 0.076417 -0.065455 0.205525 0.081539 0.108110 0.013989 0.133481 -0.256035 -0.135460 0.127465 0.113008 0.176893 -0.018049 0.062527 0.093005 -0.078622 -0.109232 0.065856 0.138583 0.097186 -0.124459 0.011706 0.113376 0.024505 -0.147662 -0.118035 0.129616 0.114539 0.165050 -0.134871 -0.036298 -0.103552 -0.108726 0.025950 0.053895 -0.173731 0.201482 -0.198697 -0.339452 0.166154 -0.014059 0.022529 0.212491 -0.051978 0.057627 0.198319 0.092990 -0.171989 -0.060376 0.084172 -0.034411 -0.065443 0.054581 -0.024187 0.072550 0.113017 0.080476 -0.170499 0.148091 -0.010503 0.158095 0.111080 0.007598 0.042551 -0.161005 -0.078712 0.318305 -0.011473 0.065593 0.121385 0.087249 -0.011178 0.053639 -0.100713 0.168689 0.120121 -0.058025 -0.161788 -0.101135 -0.080533 0.120502 -0.099477 0.187640 -0.054496 0.180532 -0.097961 0.049633 -0.019596 0.145623 0.284261 0.039761 0.053866 0.089779 -0.000676 -0.081653 0.082819 0.263937 -0.141818 0.011605 -0.028248 -0.020573 0.091329 -0.080264 -0.358647 -0.134252 0.115414 -0.066107 0.150770 -0.018897 0.168325 0.111375 -0.091567 -0.152783 -0.034834 -0.418656 -0.091504 -0.134671 0.051754 -0.129495 0.230855 -0.339259 0.208410 0.191621 0.007837 -0.016602 -0.131502 -0.059481 -0.185196 0.303028 0.017646 -0.047340
因此,除了余弦值之外,我什么也没有得到,当我跑完距离并输入 france 时,我得到了
字余弦距离
spain 0.678515
belgium 0.665923
netherlands 0.652428
italy 0.633130
switzerland 0.622323
luxembourg 0.610033
portugal 0.577154
russia 0.571507
germany 0.563291
catalonia 0.534176
那么,根据给定的概率,我如何将它与其他单词联系起来,我如何知道哪个属于哪个?