我正在训练一个模型gensim
,我的语料库是许多短句,每个句子都有一个频率,表明它在整个语料库中出现的次数。如您所见,我将其实现如下,我只是选择重复freq
次数。无论如何,如果数据很小,它应该可以工作,但是当数据增长时,频率可能会很大,它会占用太多内存,我的机器负担不起。
那么1.我可以只计算每条记录的频率而不是重复freq
次数吗?2.或者有其他节省内存的方法吗?
class AddressSentences(object):
def __init__(self, raw_path, path):
self._path = path
def __iter__(self):
with open(self.path) as fi:
headers = next(fi).split(",")
i_address, i_freq = headers.index("address"), headers.index("freq")
index = 0
for line in fi:
cols = line.strip().split(",")
freq = cols[i_freq]
address = cols[i_address].split()
# Here I do repeat
for i in range(int(freq)):
yield TaggedDocument(address, [index])
index += 1
print("START %s" % datetime.datetime.now())
train_corpus = list(AddressSentences("/data/corpus.csv"))
model = gensim.models.doc2vec.Doc2Vec(size=50, min_count=2, iter=55)
model.build_vocab(train_corpus)
model.train(train_corpus, total_examples=model.corpus_count, epochs=model.iter)
print("END %s" % datetime.datetime.now())
语料库是这样的:
address,freq
Cecilia Chapman 711-2880 Nulla St.,1000
The Business Centre,1000
61 Wellfield Road,500
Celeste Slater 606-3727 Ullamcorper. Street,600
Theodore Lowe Azusa New York 39531,700
Kyla Olsen Ap #651-8679 Sodales Av.,300