python - 使用python计算CSV文件数据的持续时间和平均值

Question

我有一个 csv 文件，我想做的是创建一个脚本，用户在其中输入源 ip 和目标 ip。一旦在 csv 文件中匹配。它将获取用户输入的所有源 IP 和目标 IP，并计算源 IP 和目标 IP 的用户输入的多个匹配会话之间的时间差。最后，脚本还将计算持续时间的平均值。下面是我的 csv 列 A 数据的示例，但是 csv 有几列，例如时间、源 IP 和目标 IP。我们可以使用包含我们已经需要的三个信息的 A 列，而不是使用三个不同的列。

_生的

2013-07-18 04:54:15.871 UDP 172.12.332.11:20547 172.12.332.11:20547 -> 172.56.213.80:53 创建忽略 0

2013-07-18 04:54:15.841 UDP 192.33.230.81:37192 192.81.130.82:37192 -> 172.81.123.70:53 创建忽略 0

2013-07-18 04:54:15.831 TCP 172.12.332.11:42547 172.12.332.11:42547
-> 172.56.213.80:53 创建忽略 0

下面是我在 python 中的代码，它不再起作用了。现在发生的只是它跳过了 ip 并且什么都不做。请帮我修复，因为我不知道为什么它不起作用。

我在python中的代码：

import sys
from sys import argv
from datetime import datetime, timedelta

script, source, destination, filename = argv #assign the script arguments to variables
line_num = 0 #for keeping track of the current line number
count = 0 #for counting occurrences of source/destination IPs
occurrences = [] 
#array to store all of the matching occurrences of source/destination IPs

line_array = [] #array to store line numbers
avg = 0 #average
total = 0 #sum of microseconds

#function for converting timedelta to microseconds
def timedelta_to_microtime(td):
return td.microseconds + (td.seconds + td.days * 86400) * 1000000
#use 'try' to catch IOexception
try:
for line in open(filename):
        #if the first character is a number, read line

            if line[0].isdigit():
           if source and destination in line:
            #increment counter for each occurrence of matching  IP combination
            count+=1
            #get the first 23 characters from the line (the date/time)
             #and convert it to a datetime object using the "%Y-%m-%d %H:%M:%S.%f"
             #format, then add it to the array named "occurrences."
        occurrences.append(datetime.strptime(line[:23], '%Y-%m-%d %H:%M:%S.%f'))
            line_array.append(line_num)
        #if the first character is not a number, it's the headers, skip them
        else:
            line_num += 2
            continue #go to next line
        line_num += 1 #counter to keep track of line (solely for testing purposes)
#if the script can't find the data file, notify user and terminate

except IOError:
    print "\n[ERROR]: Cannot read data file, check file name and try again."
    sys.exit()

print "\nFound %s matches for [source: %s] and [destination: %s]:\n" % (len(occurrences), source, destination)

if len(occurrences) != 0: 
#if there are no occurrences, there aren't any times to show! so don't print this line
    print "Time between adjacent connections:\n"

for i in range(len(occurrences)):
if i == 0:
        continue #if it is the first slot in the array, continue to next              slot (can't  subtract from array[0-1] slot)
else:
    #find difference in timedate objects (returns difference in timedelta object)
       difference = (occurrences[i-1]-occurrences[i])
       #for displaying line numbers
       time1 = line_array[i-1]
       time2 = line_array[i]
       #convert timedelta object to microseconds for computing average
       time_m = timedelta_to_microtime(difference)
       #add current microseconds to existing microseconds
       total += time_m
       print "Line %s and Line %s: %s" % (time1, time2, difference)

#check to make sure there are things to take the average of
if len(occurrences) != 0:
    #compute average
    #line read as: total divided by the length of the occurrences array as a float
    #minus 1, divided by 1,000,000 (to convert microseconds back into seconds)
    avg = (total / float((len(occurrences)-1)))/1000000
    print "\nAverage: %s seconds" % (avg)

score 1 · Accepted Answer

如果您使用像 pandas 这样的高级库，您可以更轻松地解决这个问题。让我演示一下：

假设您有下一个数据文件保存在file.csv：

2013-07-18 04:54:15.871 UDP 172.12.332.11:20547 172.12.332.11:20547 -> 172.56.213.80:53 CREATE Ignore 0
2013-07-18 04:54:15.841 UDP 192.33.230.81:37192 192.81.130.82:37192 -> 172.81.123.70:53 CREATE Ignore 0
2013-07-18 04:54:15.831 TCP 172.12.332.11:42547 172.12.332.11:42547 -> 172.56.213.80:53 CREATE Ignore 0
2013-07-18 04:54:15.821 UDP 192.33.230.81:37192 192.81.130.82:37192 -> 172.81.123.70:53 CREATE Ignore 0
2013-07-18 04:54:15.811 TCP 172.12.332.11:42547 172.12.332.11:42547 -> 172.56.213.80:53 CREATE Ignore 0

首先我们将其读入数据框：

>>> df = pd.read_table('file.csv', sep=' ', header=None, parse_dates=[[0,1]])
>>> print df.to_string()
                         0_1    2                    3                    4   5                 6       7       8  9
0 2013-07-18 04:54:15.871000  UDP  172.12.332.11:20547  172.12.332.11:20547  ->  172.56.213.80:53  CREATE  Ignore  0
1 2013-07-18 04:54:15.841000  UDP  192.33.230.81:37192  192.81.130.82:37192  ->  172.81.123.70:53  CREATE  Ignore  0
2 2013-07-18 04:54:15.831000  TCP  172.12.332.11:42547  172.12.332.11:42547  ->  172.56.213.80:53  CREATE  Ignore  0
3 2013-07-18 04:54:15.821000  UDP  192.33.230.81:37192  192.81.130.82:37192  ->  172.81.123.70:53  CREATE  Ignore  0
4 2013-07-18 04:54:15.811000  TCP  172.12.332.11:42547  172.12.332.11:42547  ->  172.56.213.80:53  CREATE  Ignore  0

我们只需要 0_1、第 4 和第 6 列

>> df = df[['0_1', 4, 6]]
>> print df.to_string()
                         0_1                    4                 6
0 2013-07-18 04:54:15.871000  172.12.332.11:20547  172.56.213.80:53
1 2013-07-18 04:54:15.841000  192.81.130.82:37192  172.81.123.70:53
2 2013-07-18 04:54:15.831000  172.12.332.11:42547  172.56.213.80:53
3 2013-07-18 04:54:15.821000  192.81.130.82:37192  172.81.123.70:53
4 2013-07-18 04:54:15.811000  172.12.332.11:42547  172.56.213.80:53

然后我们应该修复 IP 地址并删除端口：

>>> df[4] = df[4].str.split(':').str.get(0)
>>> df[6] = df[6].str.split(':').str.get(0)
>>> print df.to_string()
                         0_1              4              6
0 2013-07-18 04:54:15.871000  172.12.332.11  172.56.213.80
1 2013-07-18 04:54:15.841000  192.81.130.82  172.81.123.70
2 2013-07-18 04:54:15.831000  172.12.332.11  172.56.213.80
3 2013-07-18 04:54:15.821000  192.81.130.82  172.81.123.70
4 2013-07-18 04:54:15.811000  172.12.332.11  172.56.213.80

假设您对源地址172.12.332.11和目标感兴趣172.56.213.80。我们将过滤掉那些：

>>> filtered = df[(df[4] == '172.12.332.11') & (df[6] == '172.56.213.80')]
>>> print filtered.to_string()
                         0_1              4              6
0 2013-07-18 04:54:15.871000  172.12.332.11  172.56.213.80
2 2013-07-18 04:54:15.831000  172.12.332.11  172.56.213.80
4 2013-07-18 04:54:15.811000  172.12.332.11  172.56.213.80

现在我们需要计算时间戳之间的差异：

>>> timestamps = filtered['0_1']
>>> diffs = (timestamps.shift() - timestamps).dropna()
>>> print diffs.to_string()
2   00:00:00.040000
4   00:00:00.020000

我们现在可以计算我们想要的任何统计数据：

>>> diffs.mean() # this is in nanoseconds
30000000.0
>>> diffs.std()
14142135.62373095

编辑：对于您发送给我的数据

import io
import pandas as pd

def load_dataframe(filename):
    # First you read the data as a regular csv file and extract the _raw column values
    values = pd.read_csv(filename)['_raw'].values
    # Cleanup the values: remove newline character
    values = map(lambda x: x.replace('\n', ' '), values)
    # Add them to a stream
    s = io.StringIO(u'\n'.join(values))
    # And now everithing is the same just read it from the stream
    df = pd.read_table(s, sep='\s+', header=None, parse_dates=[[0,1]])[['0_1',4, 6]]
    df[4] = df[4].str.split(':').str.get(0)
    df[6] = df[6].str.split(':').str.get(0)
    return df

def get_diffs(df, source, destination):
    timestamps = df[(df[4] == source) & (df[6] == destination)]['0_1']
    return (timestamps.shift() - timestamps).dropna()


def main():
    filename = raw_input('Enter filename: ')
    df = load_dataframe(filename)
    while True:
       source = raw_input('Enter source IP: ').strip()
       destination = raw_input('Enter destination IP: ').strip()
       diffs = get_diffs(df, source, destination)
       for i, row in enumerate(diffs):
           print('row %d - row %d = %s' % (i+2, i+1, row.astype('timedelta64[ms]')))
       print('Mean: %s' % diffs.mean())
       yn = raw_input('Again? [y/n]: ').lower().strip()
       if yn != 'y':
            return

if __name__ == '__main__':
    main()

示例用法：

$ python test.py
Enter filename: Data.csv
Enter source IP: 172.16.122.21
Enter destination IP: 172.55.102.107
Mean: 3333333.33333
Std: 5773502.6919
Again? [y/n]: n

python - 使用python计算CSV文件数据的持续时间和平均值

1 回答 1

Related

Reference