python - 用于搜索结果并将结果导出到 .csv 文件的 Python 脚本

Question

我正在尝试在 Python 中执行以下操作，同时还使用了一些 bash 脚本。除非 Python 有更简单的方法。

我有一个日志文件，其中包含如下所示的数据：

16:14:59.027003 - WARN - Cancel Latency: 100ms - OrderId: 311yrsbj - On Venue: ABCD
16:14:59.027010 - WARN - Ack Latency: 25ms - OrderId: 311yrsbl - On Venue: EFGH
16:14:59.027201 - WARN - Ack Latency: 22ms - OrderId: 311yrsbn - On Venue: IJKL
16:14:59.027235 - WARN - Cancel Latency: 137ms - OrderId: 311yrsbp - On Venue: MNOP
16:14:59.027256 - WARN - Cancel Latency: 220ms - OrderId: 311yrsbr - On Venue: QRST
16:14:59.027293 - WARN - Ack Latency: 142ms - OrderId: 311yrsbt - On Venue: UVWX
16:14:59.027329 - WARN - Cancel Latency: 134ms - OrderId: 311yrsbv - On Venue: YZ  
16:14:59.027359 - WARN - Ack Latency: 75ms - OrderId: 311yrsbx - On Venue: ABCD
16:14:59.027401 - WARN - Cancel Latency: 66ms - OrderId: 311yrsbz - On Venue: ABCD
16:14:59.027426 - WARN - Cancel Latency: 212ms - OrderId: 311yrsc1 - On Venue: EFGH
16:14:59.027470 - WARN - Cancel Latency: 89ms - OrderId: 311yrsf7 - On Venue: IJKL  
16:14:59.027495 - WARN - Cancel Latency: 97ms - OrderId: 311yrsay - On Venue: IJKL

我需要从每一行中提取最后一个条目，然后使用每个唯一条目并搜索每一行，然后将其导出到 .csv 文件中。

我使用以下 bash 脚本来获取每个唯一条目： cat LogFile_ date +%Y%m%d.msg.log | awk '{打印 $14}' | 排序 | 独特的

根据日志文件中的上述数据，bash 脚本将返回以下结果：

ABCD
EFGH
IJKL
MNOP
QRST
UVWX
YZ

现在我想在同一个日志文件中搜索（或 grep）每个结果并返回前十个结果。我有另一个 bash 脚本来执行此操作，但是，如何使用 FOR 循环来执行此操作？因此，对于 x，其中 x = 上面的每个条目，

grep x LogFile_ date +%Y%m%d.msg.log | awk '{打印 $7}' | 排序-nr | 独特 | 头-10

然后将结果返回到 .csv 文件中。结果将如下所示（每个字段位于单独的列中）：

Column-A  Column-B  Column-C  Column-D  
ABCD        2sxrb6ab    Cancel    46ms  
ABCD      2sxrb6af  Cancel    45ms  
ABCD      2sxrb6i2  Cancel    63ms  
ABCD      2sxrb6i3  Cancel    103ms  
EFGH      2sxrb6i4  Cancel    60ms  
EFGH      2sxrb6i7  Cancel    60ms  
IJKL      2sxrb6ie  Ack       74ms  
IJKL      2sxrb6if  Ack       74ms  
IJKL      2sxrb76s  Cancel    46ms  
MNOP      vcxrqrs5  Cancel    7651ms

我是 Python 的初学者，自大学以来（13 年前）就没有做过太多的编码。任何帮助将不胜感激。谢谢。

score 1 · Accepted Answer

假设您已经打开了文件。您要做的是记录每个条目在其中的次数，也就是说，每个条目将导致一个或多个时间：

from collections import defaultdict

entries = defaultdict(list)
for line in your_file:
    # Parse the line and return the 'ABCD' part and time
    column_a, timing = parse(line)
    entries[column_a].append(timing)

完成后，您将拥有这样的字典：

{ 'ABCD': ['30ms', '25ms', '12ms'],
  'EFGH': ['12ms'],
  'IJKL': ['2ms', '14ms'] }

您现在要做的是将此字典转换为另一个len按其值排序的数据结构（即列表）。例子：

In [15]: sorted(((k, v) for k, v in entries.items()), 
                key=lambda i: len(i[1]), reverse=True)
Out[15]: 
[('ABCD', ['30ms', '25ms', '12ms']),
 ('IJKL', ['2ms', '14ms']),
 ('EFGH', ['12ms'])]

当然，这只是说明性的，您可能希望在原始for循环中收集更多数据。

score 0 · Accepted Answer

也许不像你想象的那样简洁......但我认为这可以解决你的问题。我添加了一些 try...catch 以更好地处理真实数据。

import re
import os
import csv
import collections

# get all logfiles under current directory of course this pattern can be more
# sophisticated, but it's not our attention here, isn't it?
log_pattern = re.compile(r"LogFile_date[0-9]{8}.msg.log")
logfiles = [f for f in os.listdir('./') if log_pattern.match(f)]

# top n
nhead = 10
# used to parse useful fields
extract_pattern = re.compile(
    r'.*Cancel Latency: ([0-9]+ms) - OrderId: ([0-9a-z]+) - On Venue: ([A-Z]+)')
# container for final results
res = collections.defaultdict(list)

# parse out all interesting fields
for logfile in logfiles:
    with open(logfile, 'r') as logf:
        for line in logf:
            try:  # in case of blank line or line with no such fields.
                latency, orderid, venue = extract_pattern.match(line).groups()
            except AttributeError:
                continue
            res[venue].append((orderid, latency))

# write to csv
with open('res.csv', 'w') as resf:
    resc = csv.writer(resf, delimiter=' ')
    for venue in sorted(res.iterkeys()):  # sort by Venue
        entries = res[venue]
        entries.sort()  # sort by OrderId
        for i in range(0, nhead):
            try:
                resc.writerow([venue, entries[i][0], 'Cancel ' + entries[i][1]])
            except IndexError:  # nhead can not be satisfied
                break

python - 用于搜索结果并将结果导出到 .csv 文件的 Python 脚本

2 回答 2

Related

Reference