python - 优化python脚本提取和处理大数据文件

Question

我是 python 新手，天真地为以下任务编写了一个 python 脚本：

我想创建一个包含多个对象的单词表示。每个对象基本上都是一对，并且要制作概要的词袋表示。所以对象在最终文档中被转换为。

这是脚本：

import re
import math
import itertools
from nltk.corpus import stopwords
from nltk import PorterStemmer
from collections import defaultdict
from collections import Counter
from itertools import dropwhile

import sys, getopt

inp = "inp_6000.txt"  #input file name
out = "bowfilter10"   #output file name
with open(inp,'r') as plot_data:
    main_dict = Counter()
    file1, file2 = itertools.tee(plot_data, 2)
    line_one = itertools.islice(file1, 0, None, 4)
    line_two = itertools.islice(file2, 2, None, 4)
    dictionary = defaultdict(Counter)
    doc_count = defaultdict(Counter)
    for movie_name, movie_plot in itertools.izip(line_one, line_two):
        movie_plot = movie_plot.lower()
        words = re.findall(r'\w+', movie_plot, flags = re.UNICODE | re.LOCALE)  #split words
        elemStopW = filter(lambda x: x not in stopwords.words('english'), words)   #remove stop words, python nltk
        for word in elemStopW:
            word = PorterStemmer().stem_word(word)   #use python stemmer class to do stemming
            #increment the word count of the movie in the particular movie synopsis
            dictionary[movie_name][word] += 1
            #increment the count of a partiular word in main dictionary which stores frequency of all documents.       
            main_dict[word] += 1
            #This is done to calculate term frequency inverse document frequency. Takes note of the first occurance of the word in the synopsis and neglect all other.
            if doc_count[word]['this_mov']==0:
                doc_count[word].update(count=1, this_mov=1);
        for word in doc_count:
            doc_count[word].update(this_mov=-1)
    #print "---------main_dict---------"
    #print main_dict
    #Remove all the words with frequency less than 5 in whole set of movies
    for key, count in dropwhile(lambda key_count: key_count[1] >= 5, main_dict.most_common()):
        del main_dict[key]
    #print main_dict
   .#Write to file
    bow_vec = open(out, 'w');
    #calculate the the bog vector and write it
    m = len(dictionary)
    for movie_name in dictionary.keys():
        #print movie_name
        vector = []
        for word in list(main_dict):
            #print word, dictionary[movie_name][word]
            x = dictionary[movie_name][word] * math.log(m/doc_count[word]['count'], 2)
            vector.append(x)
        #write to file
        bow_vec.write("%s" % movie_name)
        for item in vector:
            bow_vec.write("%s," % item)
        bow_vec.write("\n")

数据文件的格式和有关数据的附加信息：数据文件具有以下格式：

电影名称。空行。电影简介（上可以假设大小为150字左右）空行。

注意：<*>用于表示。

输入文件大小：
文件大小约为 200 MB。

截至目前，此脚本在 3 GHz 英特尔处理器上大约需要 10-12 小时。

注意：我正在寻找串行代码的改进。我知道并行化会改善它，但我想稍后再研究它。我想借此机会让这个串行代码更有效率。

任何帮助表示赞赏。

score 5 · Accepted Answer

首先 - 尝试删除正则表达式，它们很重。我最初的建议很糟糕——它不会奏效。或许这样会更有效率

trans_table = string.maketrans(string.string.punctuation, 
                               ' '*len(string.punctuation)).lower()
words = movie_plot.translate(trans_table).split()

（事后思考）我无法测试它，但我认为如果你将这个调用的结果存储在一个变量中

stops = stopwords.words('english')

或者可能更好 - 首先将其转换为集合（如果函数不返回一个）

stops = set(stopwords.words('english'))

你也会得到一些改善

（在评论中回答您的问题）每个函数调用都会消耗时间；如果您获得的数据块大于您没有永久使用的数据块 - 时间的浪费可能会很大至于集合与列表 - 比较结果：

In [49]: my_list = range(100)

In [50]: %timeit 10 in my_list
1000000 loops, best of 3: 193 ns per loop

In [51]: %timeit 101 in my_list
1000000 loops, best of 3: 1.49 us per loop

In [52]: my_set = set(my_list)

In [53]: %timeit 101 in my_set
10000000 loops, best of 3: 45.2 ns per loop

In [54]: %timeit 10 in my_set
10000000 loops, best of 3: 47.2 ns per loop

虽然我们处于油腻的细节 - 这里是拆分与 RE 的测量值

In [30]: %timeit words = 'This is a long; and meaningless - sentence'.split(split_let)
1000000 loops, best of 3: 271 ns per loop

In [31]: %timeit words = re.findall(r'\w+', 'This is a long; and meaningless - sentence', flags = re.UNICODE | re.LOCALE)
100000 loops, best of 3: 3.08 us per loop

score 0 · Accepted Answer

另一件事可能会降低性能 - 从字典中删除。重建字典可能更有效：

word_dict = {key: count for key, count in 
             takewhile(lambda key_count: itemgetter(1) >= 5, 
             main_dict.most_common())

总的来说，我有点懒得深入所有细节，但我使用一点参考可能会更有效率。据我所知，您不需要 *doc_count* 变量-它是多余且低效的，并且重新评估它也会降低您的性能。*main_dict.keys()* 做同样的事情 - 给你所有单词的列表一次。

这是我的想法的草图 - 我无法证明它更有效，但它看起来确实更像 python

with open(inp,'r') as plot_data:
    word_dict = Counter()
    file1, file2 = itertools.tee(plot_data, 2)
    line_one = itertools.islice(file1, 0, None, 4)
    line_two = itetools.islice(file2, 2, None, 4)
    all_stop_words = stopwords.words('english')
    movie_dict = defaultdict(Counter)
    stemmer_func = PorterStemmer().stem_word 
    for movie_name, movie_plot in itertools.izip(line_one, line_two):
        movie_plot = movie_plot.lower()
        words = <see above - I am updating original post>
        all_words = [stemmer_func(word) for word in words 
                     if not word in all_stop_words]
        current_word_counter = Counter(all_words)
        movie_dict[movie_name].update(current_word_counter)
        word_dict.update(current_word_counter)

最后一个 -字典不是一个好的变量名，它不会告诉你它包含什么

python - 优化python脚本提取和处理大数据文件

2 回答 2

Related

Reference