python-3.6 - 使写入文件过程更有效

Question

我是编程新手，我正在运行这个脚本来清理一个大文本文件（超过 12000 行）并将其写入另一个 .txt 文件。问题是当使用较小的文件（大约 500 行左右）运行它时，它执行得很快，因此我的结论是，由于文件的大小，它需要时间。因此，如果有人可以指导我使这段代码高效，我们将不胜感激。

input_file = open('bNEG.txt', 'rt', encoding='utf-8')
    l_p = LanguageProcessing()
    sentences=[]
    for lines in input_file.readlines():
        tokeniz = l_p.tokeniz(lines)
        cleaned_url = l_p.clean_URL(tokeniz)
        remove_words = l_p.remove_non_englishwords(cleaned_url)
        stopwords_removed = l_p.remove_stopwords(remove_words)
        cleaned_sentence=' '.join(str(s) for s in stopwords_removed)+"\n"
        output_file = open('cNEG.txt', 'w', encoding='utf-8')
        sentences.append(cleaned_sentence)
        output_file.writelines(sentences)
    input_file.close()
    output_file.close()

编辑：下面是答案中提到的更正代码，几乎没有其他更改以满足我的要求

input_file = open('chromehistory_log.txt', 'rt', encoding='utf-8')
    output_file = open('dNEG.txt', 'w', encoding='utf-8')
    l_p = LanguageProcessing()
    #sentences=[]
    for lines in input_file.readlines():
        #print(lines)
        tokeniz = l_p.tokeniz(lines)
        cleaned_url = l_p.clean_URL(tokeniz)
        remove_words = l_p.remove_non_englishwords(cleaned_url)
        stopwords_removed = l_p.remove_stopwords(remove_words)
        #print(stopwords_removed)
        if stopwords_removed==[]:
            continue
        else:
            cleaned_sentence=' '.join(str(s) for s in stopwords_removed)+"\n"

        #sentences.append(cleaned_sentence)
        output_file.writelines(cleaned_sentence)
    input_file.close()
    output_file.close()

score 0 · Accepted Answer

将讨论作为答案：

这里有两个问题：

您打开/创建输出文件并在循环中写入数据 - 对于输入文件的每一行。另外，您正在收集数组中的所有数据（句子）。

你有两种可能：

a）在循环之前创建文件，并在循环中写入“cleaned_sentence”（并删除收集的“sentences”）。

b) 将所有内容收集在“句子”中，并在循环后立即编写“句子”。

a) 的缺点是：这比 b) 慢一点（只要操作系统 di 不必为 b 交换内存）。但优点是：这消耗的内存要少得多，并且无论文件有多大以及计算机中安装的内存有多少，都可以使用。

python-3.6 - 使写入文件过程更有效

1 回答 1

Related

Reference