在 gensim 中,当我给一个字符串作为训练 doc2vec 模型的输入时,我得到这个错误:
TypeError('不知道如何处理 uri %s' % repr(uri))
我提到了这个问题Doc2vec : TaggedLineDocument() 但仍然对输入格式有疑问。
documents = TaggedLineDocument('myfile.txt')
myFile.txt 是否应该将标记作为列表列表或每个文档或字符串的每一行中的单独列表?
For eg
- 我有 2 个文件。
Doc 1:机器学习是计算机科学的一个子领域,从模式识别的研究发展而来。
Doc 2:Arthur Samuel 将机器学习定义为“赋予计算机学习能力的研究领域”。
那么,应该是myFile.txt
什么样子呢?
案例1:每行每个文档的简单文本
机器学习是从模式识别研究发展而来的计算机科学的一个子领域
Arthur Samuel 将机器学习定义为让计算机具备学习能力的研究领域
案例 2:包含每个文档标记的列表列表
[ ["Machine", "learning", "is", "a", "subfield", "of", "computer", "science", "that", "evolved", "from", "the", "study", "of", "pattern", "recognition"]
,
["Arthur", "Samuel", "defined", "machine", "learning", "as", "a", "Field", "of", "study", "that", "gives", "computers" ,"the", "ability", "to", "learn"] ]
案例3:单独一行中每个文档的标记列表
["Machine", "learning", "is", "a", "subfield", "of", "computer", "science", "that", "evolved", "from", "the", "study", "of", "pattern", "recognition"]
["Arthur", "Samuel", "defined", "machine", "learning", "as", "a", "Field", "of", "study", "that", "gives", "computers" ,"the", "ability", "to", "learn"]
当我在测试数据上运行它时,我想预测文档向量的句子格式应该是什么?它应该像下面的案例 1 或案例 2 还是其他什么?
model.infer_vector(testSentence, alpha=start_alpha, steps=infer_epoch)
testSentence 应该是:
案例1:字符串
testSentence = "Machine learning is an evolving field"
案例2:代币列表
testSentence = ["Machine", "learning", "is", "an", "evolving", "field"]