我建议使用正则表达式并将数据加载到某种类型的内存结构中(我假设这就是您所说的数据框的意思)。
我喜欢使用 Kodos 开发正则表达式: http: //kodos.sourceforge.net/
对于您在上面提供的日志片段,以下正则表达式将隔离一些重要部分:
^(?P<host>[0-9.a-zA-Z ]+)\s-\s-\s\[(?P<day>[0-9]{2})/(?P<month>[a-zA-Z]{3})/(?P<timestamp>[0-9:]+ \+[0-9]{4})]\s+[0-9]+\s+"([a-zA-Z0-9 /.']+)"\s+([0-9]{3})\s+([0-9]{3})\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"
Kodos 也创建了一些有用的代码片段:
rawstr = r"""^(?P<host>[0-9.a-zA-Z ]+)\s-\s-\s\[(?P<day>[0-9]{2})/(?P<month>[a-zA-Z]{3})/(?P<timestamp>[0-9:]+ \+[0-9]{4})]\s+[0-9]+\s+"([a-zA-Z0-9 /.']+)"\s+([0-9]{3})\s+([0-9]{3})\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)""""
embedded_rawstr = r"""^(?P<host>[0-9.a-zA-Z ]+)\s-\s-\s\[(?P<day>[0-9]{2})/(?P<month>[a-zA-Z]{3})/(?P<timestamp>[0-9:]+ \+[0-9]{4})]\s+[0-9]+\s+"([a-zA-Z0-9 /.']+)"\s+([0-9]{3})\s+([0-9]{3})\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)""""
matchstr = """128.0.0.2 xml12.jantzens.dk - - [04/Mar/2013:07:59:29 +0100] 15625 "POST /servlet/XMLHandler HTTP/1.1" 200 516 "-" "dk.product.xml.client.transports.ServletBridge" "-""""
# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)
# method 2: using search function (w/ external flags)
match_obj = re.search(rawstr, matchstr)
# method 3: using search function (w/ embedded flags)
match_obj = re.search(embedded_rawstr, matchstr)
# Retrieve group(s) from match_obj
all_groups = match_obj.groups()
# Retrieve group(s) by index
group_1 = match_obj.group(1)
group_2 = match_obj.group(2)
group_3 = match_obj.group(3)
group_4 = match_obj.group(4)
group_5 = match_obj.group(5)
group_6 = match_obj.group(6)
group_7 = match_obj.group(7)
group_8 = match_obj.group(8)
group_9 = match_obj.group(9)
group_10 = match_obj.group(10)
# Retrieve group(s) by name
host = match_obj.group('host')
day = match_obj.group('day')
month = match_obj.group('month')
timestamp = match_obj.group('timestamp')
您可以很容易地在此基础上将日志加载到内存中并开始处理。