gensim - 当文档被迭代地添加到模型中时，doc2vec 模型中不存在单词

Question

我编写了以下代码以迭代方式构建 Doc2vec 模型。正如我在此页面中所读到的，如果文档中的令牌数量超过 10000，那么我们需要拆分令牌并为每个段重复标签。

对于我的大多数文档，令牌的长度都超过 10000。我尝试通过编写以下代码来拆分我的令牌。但是我得到了错误，显示在我的模型中不考虑 10000 之后的令牌。

    def iter_documents(top_directory):
        mapDocName_Id=[]
        label=1
        for root, dirs, files in os.walk(top_directory):
            for fname in files:
                print fname
                inputs=[]
                tokens=[]
                with open(os.path.join(root, fname)) as f:
                    for i, line in enumerate(f):          
                        if line.startswith('clueweb09-en00'):
                            if tokens:
                                i=0
                                if len(tokens)<10000:
                                    yield LabeledSentence(tokens[:],[label])
                                else:
                                    tLen=len (tokens)
                                    times= int(math.floor(tLen/10000))
                                    for i in range(0,times):
                                        s=i*10000
                                        e=(i*10000)+9999
                                        yield LabeledSentence(tokens[s:e],[label])
                                    start=times*10000
                                    yield LabeledSentence(tokens[start:tLen],[label])
                                label+=1
                                tokens=[]
                        else:
                            tokens=tokens+line.split()
                    yield LabeledSentence(tokens[:],[label])
class docIterator(object):
    def __init__(self,top_directory):
       self.top_directory = top_directory

    def __iter__(self):
       return iter_documents(self.top_directory)

allDocs = docIterator(inputPath)

model = Doc2Vec(allDocs, size = 300, window = 5, min_count = 2, workers = 4)
model.save('my_model.doc2vec')

我使用以下代码测试我的模型，然后出现此错误：

model= Doc2Vec.load('my_model.doc2vec')

#print model['school']
print model['philadelphia']

作为学校的结果，我得到了一个向量，但我在费城得到了这个错误。philadelphia 在指数 10000 之后的代币中。

2017-02-27 13:59:36,751 : INFO : loading Doc2Vec object from /home/fl/Desktop/newInput/tokens/my_model.doc2vec

2017-02-27 13:59:36,765 : INFO : loading docvecs recursively from /home/fl/Desktop/newInput/tokens/my_model.doc2vec.docvecs.* with mmap=None

2017-02-27 13:59:36,765 : INFO : setting ignored attribute syn0norm to None

2017-02-27 13:59:36,765 : INFO : setting ignored attribute cum_table to None
Traceback (most recent call last): 
File "/home/fl/git/doc2vec_annoy/Doc2Vec_Annoy/KNN/CreateAnnoyIndex.py",
line 31, in <module>
     print model['philadelphia']   File "/home/flashkar/anaconda/lib/python2.7/site-packages/gensim/models/word2vec.py",
 line 1504, in __getitem__  
return self.syn0[self.vocab[words].index] 
KeyError: 'philadelphia'

score 0 · Accepted Answer

我通过将文档划分为长度为 10,000 但具有相同文档标识符的文档来解决我的问题。因此，我不检查令牌的长度。

gensim - 当文档被迭代地添加到模型中时，doc2vec 模型中不存在单词

1 回答 1

Related

Reference