python - gensim.corpora 字典类型错误将标记化列解释为单个字符串

Question

这是有问题的代码：

from gensim.corpora import Dictionary
tweets_dictionary = Dictionary(df.tokenized)

Panda Dataframe df 的构建方式如下两列“created_at”和“tokenized”。“tokenized”由一系列单词组成：

运行有问题的代码时收到以下错误消息：

TypeError: doc2bow expects an array of unicode tokens on input, not a single string

这对我来说很奇怪，因为标记化的列不是单个字符串。我尝试将列转换为单个列表、列表列表和元组，但到目前为止没有任何效果....提前感谢您的帮助！

score 0 · Accepted Answer

好的......我很愚蠢：将“df.tokenized”放入列表中有效，我只是忘记在执行之前保存代码。

所以正确的代码是：

from gensim.corpora import Dictionary
tweets_dictionary = Dictionary([df.tokenized])

1 回答 1