0

我有一个具有甲酸盐的 .log 文件:

t00aws.hma.com 101.225.128.165 AnonymousUser - [30/Aug/2013:02:17:22 -0400] "PUT /v1/patients/0000341934-821?accessToken=54189273 HTTP/1.1" 204 0 0.151 0.151 0.139 - 0.000 - "Java/1.6.0_31"
t00awsp.hma.com 101.225.128.165 AnonymousUser - [30/Aug/2013:02:17:22 -0400] "PUT /v1/encounters/0-2900172?accessToken=54189273 HTTP/1.1" 204 0 0.189 10.225.128.165 - 0.000 - "Java/1.6.0_31" 
t00awsp.hma.com 101.225.128.165 AnonymousUser - [30/Aug/2013:02:17:31 -0400] "PUT /v1/encounters/84 -843-5085577?accessToken=54189273 HTTP/1.1" 204 0 0.151 10.225.128.165 - 0.000 - "Java/1.6.0_31"
t00aws.hma.com 101.225.128.165 AnonymousUser - [30/Aug/2013:02:17:31 -0400] "PUT /v1/encounters/84 843-5085577?accessToken=54189273 HTTP/1.1" 204 0 0.147 0.146 0.135 - 0.000 - "Java/1.6.0_31" 
t00awsp2.hma.com 102.225.128.165 AnonymousUser - [30/Aug/2013:02:17:34 -0400] "PUT /v1/encounters/000 63-1332770?accessToken=54189273 HTTP/1.1" 204 0 0.152 0.152 0.140 - 0.000 - "Java/1.6.0_31" 

我已经编写了一个方法来解析这个日志文件,并希望使用字典找到调用 url n 次的 IP 地址,例如:

url_dict : {
'10.225.128.165' : ['v1/ready' , 4],     ####   'ip' : ['url' , count]
'10.225.128.162' : ['/v2/fab' , 2]
}

这是我在views.py中的代码

def get_reports_hipaa(request): 
    wwwlog = lines_from_dir('*.log', '/home/arya/c/') 
    log_re = re.compile('^(?P<hostname>[\w.]*) (?P<clientip>[\d.]+) (?P<user>[\w-]+) (?P<application>[\w-]+) '+\
                        '(?P<request>\[\d+/\w+/\d+\:\d+\:\d+\:\d+[ \t]\-\d+\]) "(?P<method>GET|POST|PUT|DELETE|HEAD|TRACE|OPTIONS) (?P<url>.*?)'+\
                        ' (?P<protocol>HTTP/1.[01])" (?P<status>\d+) (?P<bytes_sent>\d+) (?P<request_time>[\d.-]+) (?P<upstream_response_time>[\d.-]+)'+\
                        ' (?P<hma_exec_time>[\d.-]+) (?P<mongo_exec_time>[\d.-]+) (?P<audit_response_time>[\d.-]+) (?P<queries_count>[\d.-]+) "(?P<user_agent>.*?)"$')
    url_list_4xx = []
    ip = {} 
    count = 0 
    unique_clientip = set()
    unique_url = set()
    url_dict = {}


    for line in wwwlog :
        print line
        m = log_re.match(line) 
        if m : 
            request1 = m.groupdict()  

            resource_name = get_resource_name(request1['url']) 
            time = request1["request"].split(" ")[0].split("[")[1] 
            time = datetime.strptime(str(time), '%d/%b/%Y:%H:%M:%S')  
            list = []
            clientip = request1["clientip"]
            if clientip  not in unique_clientip : 
                ip[clientip] = 0

            if clientip in unique_clientip :  
                url =  remove_access_token(request1['url'])
                if url in unique_url : 
                    list.append(url)
                    ip[clientip] += 1
                    list.append(ip[clientip])
                    url_dict[clientip]  = list 
                else:
                    unique_url.add(url)
            else :
                unique_clientip.add(request1["clientip"])

    return render(request, "hipaa_report.html", {"url_dict": url_dict})

我的输出不正确,有什么好的逻辑建议吗?

4

1 回答 1

2

使用元组键url_dict

key = (clientip, url)
url_dict[key] += 1

url_dict = defaultdict(0)

使计数器自动从 0 开始,这会将循环变为:

for line in wwwlog :
    print line
    m = log_re.match(line) 
    if m : 
        request1 = m.groupdict()  

        clientip = request1["clientip"]
        url =  remove_access_token(request1['url'])

        key = (clientip, url)
        url_dict[key] += 1
于 2013-09-03T07:20:50.390 回答