使用正则表达式和拆分行的其他答案将完成工作,但如果您想要一个完全可维护的解决方案,并与您一起成长,您应该构建一个语法。我喜欢pyparsing
这个:
S ='''
7:06:32 (slbfd) IN: "lq_viz_server" aqeela@nabltas1
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS ) Albahraj@nabwmps3 (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj@nabwmps3'''
from pyparsing import *
from collections import defaultdict
# Define the grammar
num = Word(nums)
marker = Literal(":").suppress()
timestamp = Group(num + marker + num + marker + num)
label = Literal("(slbfd)")
flag = Word(alphas)("flag") + marker
name = QuotedString(quoteChar='"')("name")
line = timestamp + label + flag + name + restOfLine
grammar = OneOrMore(Group(line))
# Now parsing is a piece of cake!
P = grammar.parseString(S)
counts = defaultdict(int)
for x in P:
if x.flag=="IN": counts[x.name] += 1
if x.flag=="OUT": counts[x.name] -= 1
for key in counts:
print key, counts[key]
这给出了输出:
lq_viz_server 1
OFM32 -1
如果您的示例日志文件更长,这看起来会更令人印象深刻。pyparsing 解决方案的美妙之处在于能够适应未来更复杂的查询(例如,抓取和解析时间戳、提取电子邮件地址、解析错误代码......)。这个想法是您编写独立于查询的语法 - 您只需将原始文本转换为计算机友好格式,将解析实现从其使用中抽象出来。