python - load access log file into dataframe

Question

I need to process a access log file and work on that. Is it posible to load a log file like access log into a data frame and work on that. I have a time stamp, response time and request url which I would like to work on.

example log line:

128.0.0.2 xml12.jantzens.dk - - [04/Mar/2013:07:59:29 +0100] 15625 "POST /servlet/XMLHandler HTTP/1.1" 200 516 "-" "dk.product.xml.client.transports.ServletBridge" "-"

Update: I am extracting the response time and request using regular exp. So I am trying to create a dataset by adding DF.

df2 = pd.DataFrame({ 'time' : pd.Timestamp(timestamp),
                     'reponsetime' : responsetime,
                     'requesturl' : requesturl })

score 0 · Accepted Answer

我建议使用正则表达式并将数据加载到某种类型的内存结构中（我假设这就是您所说的数据框的意思）。

我喜欢使用 Kodos 开发正则表达式： http: //kodos.sourceforge.net/

对于您在上面提供的日志片段，以下正则表达式将隔离一些重要部分：

^(?P<host>[0-9.a-zA-Z ]+)\s-\s-\s\[(?P<day>[0-9]{2})/(?P<month>[a-zA-Z]{3})/(?P<timestamp>[0-9:]+ \+[0-9]{4})]\s+[0-9]+\s+"([a-zA-Z0-9 /.']+)"\s+([0-9]{3})\s+([0-9]{3})\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"

Kodos 也创建了一些有用的代码片段：

rawstr = r"""^(?P<host>[0-9.a-zA-Z ]+)\s-\s-\s\[(?P<day>[0-9]{2})/(?P<month>[a-zA-Z]{3})/(?P<timestamp>[0-9:]+ \+[0-9]{4})]\s+[0-9]+\s+"([a-zA-Z0-9 /.']+)"\s+([0-9]{3})\s+([0-9]{3})\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)""""
embedded_rawstr = r"""^(?P<host>[0-9.a-zA-Z ]+)\s-\s-\s\[(?P<day>[0-9]{2})/(?P<month>[a-zA-Z]{3})/(?P<timestamp>[0-9:]+ \+[0-9]{4})]\s+[0-9]+\s+"([a-zA-Z0-9 /.']+)"\s+([0-9]{3})\s+([0-9]{3})\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)"\s+"([a-zA-Z0-9 /.-]+)""""
matchstr = """128.0.0.2 xml12.jantzens.dk - - [04/Mar/2013:07:59:29 +0100] 15625 "POST /servlet/XMLHandler HTTP/1.1" 200 516 "-" "dk.product.xml.client.transports.ServletBridge" "-""""

# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)

# method 2: using search function (w/ external flags)
match_obj = re.search(rawstr, matchstr)

# method 3: using search function (w/ embedded flags)
match_obj = re.search(embedded_rawstr, matchstr)

# Retrieve group(s) from match_obj
all_groups = match_obj.groups()

# Retrieve group(s) by index
group_1 = match_obj.group(1)
group_2 = match_obj.group(2)
group_3 = match_obj.group(3)
group_4 = match_obj.group(4)
group_5 = match_obj.group(5)
group_6 = match_obj.group(6)
group_7 = match_obj.group(7)
group_8 = match_obj.group(8)
group_9 = match_obj.group(9)
group_10 = match_obj.group(10)

# Retrieve group(s) by name
host = match_obj.group('host')
day = match_obj.group('day')
month = match_obj.group('month')
timestamp = match_obj.group('timestamp')

您可以很容易地在此基础上将日志加载到内存中并开始处理。

python - load access log file into dataframe

1 回答 1

Related

Reference