python - Python：运行脚本一次写入多个文件

Question

编程新手。我编写了一个 Python 代码（在 stackoverflow 的帮助下）读取目录中的 XML 文件，并使用 BeautifulSoup 复制元素的文本并将其写入输出目录。这是我的代码：

from bs4 import BeautifulSoup
import os, sys

source_folder='/Python27/source_xml'

for article in os.listdir(source_folder):
    soup=BeautifulSoup(open(source_folder+'/'+article))

    #I think this strips any tags that are nested in the sample_tag
    clean_text=soup.sample_tag.get_text("  ",strip=True)

    #This grabs an ID which I used as the output file name
    article_id=soup.article_id.get_text("  ",strip=True)

    with open(article_id,"wb") as f:
        f.write(clean_text.encode("UTF-8"))

问题是我需要它来运行 100,000 个 XML 文件。我昨晚启动了这个程序，据我估计它需要 15 个小时才能运行。例如，有没有一种方法可以一次读取/写入 100 个或 1000 个，而不是一次读取和写入一个？

score 3 · Accepted Answer

多处理模块是您的朋友。您可以使用这样的进程池在所有 cpu 上自动扩展代码：

from bs4 import BeautifulSoup
import os, sys
from multiprocessing import Pool


source_folder='/Python27/source_xml'

def extract_text(article):
    soup=BeautifulSoup(open(source_folder+'/'+article))

    #I think this strips any tags that are nested in the sample_tag
    clean_text=soup.sample_tag.get_text("  ",strip=True)

    #This grabs an ID which I used as the output file name
    article_id=soup.article_id.get_text("  ",strip=True)

    with open(article_id,"wb") as f:
            f.write(clean_text.encode("UTF-8"))

def main():
    pool = Pool()
    pool.map(extract_text, os.listdir(source_folder))

if __name__ == '__main__':
    main()

score 0 · Accepted Answer

尽管我很讨厌这个主意，但您可以使用线程。python 中的 GIL 在 I/O 操作上发布，因此您可以运行线程池并只使用作业。

python - Python：运行脚本一次写入多个文件

2 回答 2

Related

Reference