1

我无法使用 Gensim 重现 word2vec 结果,并且某些结果没有意义。Gensim 是一个开源工具包,旨在使用高效的在线算法处理大型文本集合,包括Google 的 word2vec 算法的 python 实现

我正在关注在线教程,但无法重现结果。(positive=['woman', 'king'],negative=['man']) 最相似的词应该是 'wenceslaus' 和 'queen'。相反,我得到了 'u'eleonore' 和 'iv'。“快”最相似的是慢,“快”是“mitsumi”。

有什么见解吗?以下是我的代码和结果:

>>> 从 gensim.models 导入 word2vec

>>> 导入日志

>>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

>>> 句子 = word2vec.Text8Corpus('\tmp\text8')

>>> 模型 = word2vec.Word2Vec(sentences, size=200)

>>> model.most_similar(positive=['woman', 'king'],negative=['man'], topn=2)

out[63]: [(u'eleonore', 0.5138808...), (u'iv',0.510519325...)]

>>> model.most_similar(positive=['fast'])

Out[64]: [(u'slow', 0.48932...), (u'paced', 0.46925...)...]

>>> model.most_similar(positive=['quick'],topn=1)

出 [65]: [(u'mitsumi', 0.48545..)]

4

2 回答 2

3

你的结果确实有意义。

word2vec它的随机性有几个原因 - 随机向量初始化、线程等 - 所以你没有得到与教程完全相同的结果并不奇怪。

此外,“eleonore”是公主的名字,“iv”是罗马数字;这两个术语都与所需的“女王”有关。如果对结果持怀疑态度,请尝试检查文本本身:

>>> import nltk
>>> with open('/tmp/text8', 'r') as f:
>>>     text = nltk.Text(f.read().split()
>>> text.concordance('eleonore')

Displaying 6 of 6 matches:
en the one eight year old princess eleonore of portugal whose dowry helped him
nglish historian one six five five eleonore gonzaga wife of ferdinand ii holy 
riage in one six zero three was to eleonore of hohenzollern born one five eigh
frederick duke of prussia and mary eleonore of kleve children of joachim frede
ive child of joachim frederick and eleonore of hohenzollern marie eleonore bor
and eleonore of hohenzollern marie eleonore born two two march one six zero se

但是,如果您仍然对结果不满意,您可能需要做以下事情:

  1. 尝试多次运行。它们都会产生不同的向量。(虽然不是一个聪明的方法。)
  2. 尝试更大的topn并观察不止一两个类似的术语。“eleonore”或“iv”可能是“queen”的紧密竞争者。

    >>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=20)
    [('iii', 0.51035475730896), ('vii', 0.5096821188926697), ('frederick', 0.5058648586273193), ('son', 0.5021922588348389), ('wenceslaus', 0.500456690788269), ('eleonore', 0.49771684408187866), ('iv', 0.4948177933692932), ('henry', 0.49309787154197693), ('viii', 0.4924878478050232), ('sigismund', 0.49033164978027344), ('letsie', 0.4879177212715149), ('wladislaus', 0.4867924451828003), ('boleslaus', 0.47995251417160034), ('dagobert', 0.4767090082168579), ('corvinus', 0.476703941822052), ('abdicates', 0.47494029998779297), ('jadwiga', 0.4712049961090088), ('eldest', 0.4683353900909424), ('anjou', 0.46781229972839355), ('queen', 0.46647682785987854)]
    
  3. 尝试调整min_count单词。这将帮助您删除不常见且看似“嘈杂”的单词。(默认min_count值为 5。)

    >>> model = word2vec.Word2Vec(sentences, size=200, min_count=30)
    >>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=20)
    [('queen', 0.5332179665565491), ('son', 0.5205873250961304), ('daughter', 0.49179190397262573), ('henry', 0.4898293614387512), ('antipope', 0.4872135818004608), ('eldest', 0.48199930787086487), ('viii', 0.47991085052490234), ('matilda', 0.4746955633163452), ('iii', 0.4663817882537842), ('duke', 0.46338942646980286), ('jadwiga', 0.4630076289176941), ('vii', 0.45885157585144043), ('aquitaine', 0.45757925510406494), ('vasa', 0.45703941583633423), ('pretender', 0.4559580683708191), ('reigned', 0.4528595805168152), ('marries', 0.4490123391151428), ('philip', 0.44660788774490356), ('anne', 0.4405106008052826), ('princess', 0.43850386142730713)]
    
于 2015-07-22T05:27:20.027 回答
1

https://blog.esciencecenter.nl/king-man-woman-king-9a7fd2935a85 这篇文章解释了词嵌入并不总是给出 King-man+woman=Queen 的预测结果。有时嵌入为 King 计算

https://medium.com/plotly/understanding-word-embedding-arithmetic-why-theres-no-single-answer-to-king-man-woman-cd2760e2cb7f。这篇文章解释了嵌入的数学以及为什么没有一个答案 king-man+woman=Queen

于 2021-09-07T14:41:10.807 回答