我正在尝试使用大约 170K 行的文件来训练 word2vec 模型,每行一个句子。
我想我可能代表一个特殊的用例,因为“句子”有任意字符串而不是字典单词。每个句子(行)大约有 100 个单词,每个“单词”大约有 20 个字符,包括字符"/"
和数字。
训练代码很简单:
# as shown in http://rare-technologies.com/word2vec-tutorial/
import gensim, logging, os
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
class MySentences(object):
def __init__(self, dirname):
self.dirname = dirname
def __iter__(self):
for fname in os.listdir(self.dirname):
for line in open(os.path.join(self.dirname, fname)):
yield line.split()
current_dir = os.path.dirname(os.path.realpath(__file__))
# each line represents a full chess match
input_dir = current_dir+"/../fen_output"
output_file = current_dir+"/../learned_vectors/output.model.bin"
sentences = MySentences(input_dir)
model = gensim.models.Word2Vec(sentences,workers=8)
事情是,事情真的很快,最多 10 万个句子(我的 RAM 稳步上升),但后来我的 RAM 用完了,我可以看到我的 PC 已经开始交换,并且训练停止了。我没有很多可用的 RAM,只有大约 4GB,word2vec
并且在开始交换之前用完了所有内存。
我认为我已将 OpenBLAS 正确链接到 numpy:这就是numpy.show_config()
告诉我的:
blas_info:
libraries = ['blas']
library_dirs = ['/usr/lib']
language = f77
lapack_info:
libraries = ['lapack']
library_dirs = ['/usr/lib']
language = f77
atlas_threads_info:
NOT AVAILABLE
blas_opt_info:
libraries = ['openblas']
library_dirs = ['/usr/lib']
language = f77
openblas_info:
libraries = ['openblas']
library_dirs = ['/usr/lib']
language = f77
lapack_opt_info:
libraries = ['lapack', 'blas']
library_dirs = ['/usr/lib']
language = f77
define_macros = [('NO_ATLAS_INFO', 1)]
openblas_lapack_info:
NOT AVAILABLE
lapack_mkl_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
blas_mkl_info:
NOT AVAILABLE
mkl_info:
NOT AVAILABLE
我的问题是:这在没有大量可用 RAM 的机器(如我的)上是预期的,我应该获得更多的 RAM 或以更小的部分训练模型吗?还是看起来我的设置配置不正确(或者我的代码效率低下)?
先感谢您。