python - 如何在 Python 中的几个大型 apache 日志文件上运行逻辑？

Question

我有一堆 apache 日志文件，我需要从中解析和提取信息。我的脚本适用于单个文件，但我想知道处理多个文件的最佳方法。

我是不是该：

- loop through all files and create a temporary file holding all contents
- run my logic on the "contact-ed" file

或者

- loop through every file
- run my logic file by file
- try to merge the results of every file

Filewise 我正在查看大约一年的日志，每天大约有 200 万个条目，报告了大量机器。我的单文件脚本正在为每台机器生成一个带有“条目”的对象，所以我想知道：

问题：
我应该生成一个联合临时文件还是逐个文件运行，生成基于文件的对象并将 x 文件与相同 y 机器的条目合并？

score 2 · Accepted Answer

您可以使用globandfileinput模块有效地循环所有这些并将其视为一个“大文件”：

import fileinput
from glob import glob

log_files = glob('/some/dir/with/logs/*.log')
for line in fileinput.input(log_files):
    pass # do something

1 回答 1