python - 带有大文本列表的python内存错误

Question

我正在尝试获取一堆产品评论（200K+ 评论）中使用的每个单词、2 个单词和 3 个单词短语的列表。评论作为 json 对象提供给我。我试图通过使用生成器从内存中删除尽可能多的数据，但我仍然内存不足，不知道下一步该去哪里。我在这里回顾了生成器/迭代器的使用和一个非常相似的问题：文本 Python 中的重复短语，但我仍然无法让它适用于大型数据集（如果我接受部分评论，我的代码运行良好）。

我的代码的格式（或至少预期的格式）如下： - 逐行读取包含 json 对象的文本文件 - 将当前行解析为 json 对象并提取评论文本（还有其他数据在我不需要的字典中） - 将评论分解为组成词，清理这些词，然后将它们添加到我的主列表中，或者如果该词/短语已经存在，则增加该词/短语的计数器

任何帮助将不胜感激！

import json
import nltk
import collections

#define set of "stopwords", those that are removed
s_words=set(nltk.corpus.stopwords.words('english')).union(set(["it's", "us", " "]))

#load tokenizer, which will split text into words, and stemmer - which stems words
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
stemmer = nltk.SnowballStemmer('english')
master_wordlist = collections.defaultdict(int)
#open the raw data and read it in by line
allReviews = open('sample_reviews.json')
lines = allReviews.readlines()
allReviews.close()


#Get all of the words, 2 and 3 word phrases, in one review
def getAllWords(jsonObject):
    all_words = []
    phrase2 = []
    phrase3 = []

    sentences=tokenizer.tokenize(jsonObject['text'])
    for sentence in sentences:
        #split up the words and clean each word
        words = sentence.split()

        for word in words:
            adj_word = str(word).translate(None, '"""#$&*@.,!()-                     +?/[]1234567890\'').lower()
            #filter out stop words
            if adj_word not in s_words:

                all_words.append(str(stemmer.stem(adj_word)))

                #add all 2 word combos to list
                phrase2.append(str(word))
                if len(phrase2) > 2:
                    phrase2.remove(phrase2[0])
                if len(phrase2) == 2:
                    all_words.append(tuple(phrase2))

                #add all 3 word combos to list
                phrase3.append(str(word))
                if len(phrase3) > 3:
                    phrase3.remove(phrase3[0])
                if len(phrase3) == 3:
                    all_words.append(tuple(phrase3))

    return all_words
#end of getAllWords

#parse each line from the txt file to a json object
for c in lines:
    review = (json.loads(c))
    #counter instances of each unique word in wordlist
    for phrase in getAllWords(review):
        master_wordlist[phrase] += 1

score 1 · Accepted Answer

我相信调用readlines将整个文件加载到内存中，应该有更少的开销只是逐行迭代文件对象

#parse each line from the txt file to a json object
with open('sample_reviews.json') as f:
  for line in f:
    review = (json.loads(line))
    #counter instances of each unique word in wordlist
    for phrase in getAllWords(review):
        master_wordlist[phrase] += 1

python - 带有大文本列表的python内存错误

1 回答 1

Related

Reference