我想生成一个“词袋”矩阵,其中包含文档以及文档中单词的相应计数。为了做到这一点,我运行下面的代码来初始化词袋矩阵。不幸的是,在我阅读文档的行中有 x 数量的文档后,我收到了内存错误。有没有更好的方法可以避免内存错误?请注意,我想处理大量文档 ~ 2.000.000,只有 8 Gb 的 RAM。
def __init__(self, paths, words_count, normalize_matrix = False ,trainingset_size = None, validation_set_words_list = None):
    '''
    Open all documents from the given path.
    Initialize the variables needed in order
    to construct the word matrix.
    Parameters
    ----------
    paths: paths to the documents.
    words_count: number of words in the bag of words.
    trainingset_size: the proportion of the data that should be set to the training set.
    validation_set_words_list: the attributes for validation.
    '''
    print '################ Data Processing Started ################'
    self.max_words_matrix = words_count
    print '________________ Reading Docs From File System ________________'
    timer = time()
    for folder in paths:
        self.class_names.append(folder.split('/')[len(folder.split('/'))-1])
        print '____ dataprocessing for category '+folder
        if trainingset_size == None:
            docs = os.listdir(folder)
        elif not trainingset_size == None and validation_set_words_list == None:
            docs = os.listdir(folder)[:int(len(os.listdir(folder))*trainingset_size-1)]
        else:
            docs = os.listdir(folder)[int(len(os.listdir(folder))*trainingset_size+1):]
        count = 1
        length = len(docs)
        for doc in docs:
            if doc.endswith('.txt'):
                d = open(folder+'/'+doc).read()
                # Append a filtered version of the document to the document list.
                self.docs_list.append(self.__filter__(d))
                # Append the name of the document to the list containing document names.
                self.docs_names.append(doc)
                # Increase the class indices counter.
                self.class_indices.append(len(self.class_names)-1)
            print 'Processed '+str(count)+' of '+str(length)+' in category '+folder
            count += 1