您好社区成员,
目前,我正在实现 Word2Vec 算法。
首先,我提取了数据(句子),将句子分解并拆分为标记(单词),删除标点符号并将标记存储在单个列表中。该列表基本上包含单词。然后我计算了单词的频率,然后根据频率计算它的出现次数。结果是一个列表。
接下来,我尝试使用 gensim 加载模型。但是,我面临一个问题。问题是关于the word is not in the vocabulary
。代码片段,无论我尝试过什么,如下所示。
import nltk, re, gensim
import string
from collections import Counter
from string import punctuation
from nltk.tokenize import word_tokenize
from gensim.models import Word2Vec
from nltk.corpus import gutenberg, stopwords
def preprocessing():
raw_data = (gutenberg.raw('shakespeare-hamlet.txt'))
tokens = word_tokenize(raw_data)
tokens = [w.lower() for w in tokens]
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
global words
words = [word for word in stripped if word.isalpha()]
sw = (stopwords.words('english'))
sw1= (['.', ',', '"', '?', '!', ':', ';', '(', ')', '[', ']', '{', '}'])
sw2= (['for', 'on', 'ed', 'es', 'ing', 'of', 'd', 'is', 'has', 'have', 'been', 'had', 'was', 'are', 'were', 'a', 'an', 'the', 't', 's', 'than', 'that', 'it', '&', 'and', 'where', 'there', 'he', 'she', 'i', 'and', 'with', 'it', 'to', 'shall', 'why', 'ham'])
stop=sw+sw1+sw2
words = [w for w in words if not w in stop]
preprocessing()
def freq_count():
fd = nltk.FreqDist(words)
print(fd.most_common())
freq_count()
def word_embedding():
for i in range(len(words)):
model = Word2Vec(words, size = 100, sg = 1, window = 3, min_count = 1, iter = 10, workers = 4)
model.init_sims(replace = True)
model.save('word2vec_model')
model = Word2Vec.load('word2vec_model')
similarities = model.wv.most_similar('hamlet')
for word, score in similarities:
print(word , score)
word_embedding()
注意:我在 Windows 操作系统中使用 Python 3.7。从 中syntax of gensim
,建议使用句子并拆分为标记,并将其应用于构建和训练模型。我的问题是如何将其应用于仅包含单词的单个列表的语料库。在模型训练期间,我也使用列表指定了单词,即 [words]。