python - 用 python 处理许多巨大的日志文件

Question

我正在使用一些 python 脚本来进行统计。日志的一种内容是这样的，我称之为 A 日志：每个 A 日志的格式为：

[2012-09-12 12:23:33] SOME_UNIQ_ID filesize

我称之为 B 日志的另一个日志具有以下格式：

[2012-09-12 12:24:00] SOME_UNIQ_ID

我需要计算A日志中有多少条记录也在B日志中，并获取具有相同记录ID的两条记录的时间间隔。我的实现是将B日志的所有时间和ID加载到地图中，然后迭代A 日志以检查它的 ID 是否存在于地图中。问题是它投射了太多内存，因为我在 B 日志中有近 1 亿条记录。有什么建议可以提高性能和内存使用率吗？谢谢。

score 3 · Accepted Answer

您可以尝试根据“A”是否适合内存并顺序扫描“B”来反转查找。

否则，将日志文件加载到包含 (timestamp, uniq_id, rest_of_line) 的两个表 (log_a, log_b) 的 SQLite3 数据库中，然后在上执行 SQL 连接uniq_id，并对结果执行您需要的任何处理。这将保持较低的内存开销，使 SQL 引擎能够进行连接，但当然需要有效地复制磁盘上的日志文件（但这在大多数系统上通常不是问题）

例子

import sqlite3
from datetime import datetime

db = sqlite3.connect(':memory:')

db.execute('create table log_a (timestamp, uniq_id, filesize)')
a = ['[2012-09-12 12:23:33] SOME_UNIQ_ID filesize']
for line in a:
    timestamp, uniq_id, filesize = line.rsplit(' ', 2)
    db.execute('insert into log_a values(?, ?, ?)', (timestamp, uniq_id, filesize))
db.commit()

db.execute('create table log_b (timestamp, uniq_id)')
b = ['[2012-09-12 13:23:33] SOME_UNIQ_ID']
for line in b:
    timestamp, uniq_id = line.rsplit(' ', 1)
    db.execute('insert into log_b values(?, ?)', (timestamp, uniq_id))
db.commit()

TIME_FORMAT = '[%Y-%m-%d %H:%M:%S]'
for matches in db.execute('select * from log_a join log_b using (uniq_id)'):
    log_a_ts = datetime.strptime(matches[0], TIME_FORMAT)
    log_b_ts = datetime.strptime(matches[3], TIME_FORMAT)
    print matches[1], 'has a difference of', abs(log_a_ts - log_b_ts)
    # 'SOME_UNIQ_ID has a difference of 1:00:00'
    # '1:00:00' == datetime.timedelta(0, 3600)

注意：

sqlite3 上的.connect应该是文件名
ab应该是你的文件

score 1 · Accepted Answer

Try this:

Externally sort both the files
Read the A Logs file and save SOME_UNIQ_ID (A)
Read the B Logs file and save SOME_UNIQ_ID (B)
Compare the SOME_UNIQ_ID (B) with SOME_UNIQ_ID (A)
- If it is lesser, read B Logs file again
- If it is greater, read A Logs file again and compare with saved SOME_UNIQ_ID (B)
- If it is equal find the time gap

Assuming external sort works efficiently, you end up the process reading both files just once.

score 0 · Accepted Answer

由于瓶颈是时间戳的转换。我将此操作拆分为许多生成 A 日志和 B 日志的隔离机器。这些机器将字符串时间戳转换为纪元时间，而 CENTER 机器使用所有这些日志来计算我的结果现在几乎是原始方式的 1/20 时间。我在这里发布我的解决方案，感谢你们所有人。

score 0 · Accepted Answer

我建议使用同时支持唯一 IDdatetime和uniqueidentifier显示形式的数据库。它来自 Window，如果您使用 Windows 执行任务，您可以使用 Microsoft SQL 2008 R2 Express 版本（免费）。这两个表不会使用任何类型的键。

您可以使用MS SQL 的bcp 实用程序，这可能是从文本文件（或 BULK INSERT命令）插入数据的最快方法之一。

只有在插入所有记录后才能创建唯一标识符上的索引。否则，索引的存在会使插入操作变慢。那么内部连接在技术上应该尽可能快。

score 0 · Accepted Answer

首先，身份证的格式是什么？是全球唯一的吗？

我会选择这三个选项之一。

使用数据库
两组 id 的并集
Unix 工具

~~我假设您更喜欢第二种选择。仅从 A 和 B 加载 ID。假设 id 适合 32 位整数，则内存使用量将少于 1GB。然后加载相同id的日期时间并计算差距。~~第一个选项将是最好的要求。

score 0 · Accepted Answer

如果可以对唯一 ID 进行排序（例如按字母顺序或数字顺序），您可以批量比较。

假设该示例 ID 是数字，范围为 1 - 10^7。然后，您可以首先将前 10^6 个元素放在哈希表中，对第二个文件进行顺序扫描以查找匹配记录。

在pseudopython中，我没有测试过这个：

for i in xrange(0,9):
    for line in file1:
        time, id = line.split(']')
        id = int(id)
        if i * 10**6 < id < (i+1) * 10**6:
            hash_table[id] = time

    for line in file2:
        time, id = line.split(']') # needs a second split to get the id
        id = int(id)
        if id in hashtable:
            # compare timestamps

如果 ID 不是数字，您可以使用字母键创建批次：

if id.startswith(a_letter_from_the_alphabet):
    hash_table[id] = time

python - 用 python 处理许多巨大的日志文件

6 回答 6

Related

Reference