0

编写一个程序,打开一个 tcpdump 文件并重新排序转储的行,以便来自每个会话的数据包聚集在一起。每个会话都输出到其自己的文件中,并具有从该会话的 IP 和端口地址生成的唯一名称。

示例 tcpdump.txt:

13:36:21.808234 IP 142.55.112.172.1692 > 142.55.1.9.80: Flags [P.], seq 111310335:111310775, ack 1980466801, win 64427, length 440
13:36:21.811651 IP 142.55.1.9.80 > 142.55.117.173.3783: Flags [.], seq 2006591246:2006592626, ack 850049956, win 33120, length 1380
13:36:21.811904 IP 142.55.1.9.80 > 142.55.117.173.3783: Flags [.], seq 1380:2760, ack 1, win 33120, length 1380
13:36:21.812016 IP 142.55.1.9.80 > 142.55.117.173.3783: Flags [P.], seq 2760:4096, ack 1, win 33120, length 1336
13:36:21.812278 IP 142.55.1.9.80 > 142.55.117.173.3783: Flags [.], seq 4096:5476, ack 1, win 33120, length 1380
13:36:21.812413 IP 142.55.117.173.3783 > 142.55.1.9.80: Flags [.], ack 4096, win 65535, length 0
13:36:21.812538 IP 142.55.1.9.80 > 142.55.117.173.3783: Flags [.], seq 5476:6856, ack 1, win 33120, length 1380
13:36:21.812876 IP 142.55.117.173.3783 > 142.55.1.9.80: Flags [.], ack 6856, win 65535, length 0
13:36:21.813234 IP 142.55.1.9.80 > 142.55.117.173.3783: Flags [.], seq 6856:8236, ack 1, win 33120, length 1380
13:36:21.813358 IP 142.55.1.9.80 > 142.55.117.173.3783: Flags [.], seq 8236:9616, ack 1, win 33120, length 1380
13:36:21.813396 IP 142.55.117.187.4080 > 142.55.1.9.80: Flags [P.], seq 1883704283:1883704589, ack 2004811294, win 65535, length 306
13:36:21.813610 IP 142.55.1.9.80 > 142.55.117.173.3783: Flags [P.], seq 9616:10599, ack 1, win 33120, length 983
13:36:21.813940 IP 142.55.117.173.3783 > 142.55.1.9.80: Flags [.], ack 9616, win 65535, length 0

这是我到目前为止所拥有的:

import re 

read_file = open('tcpdump.txt', 'r')

source_ip = " "
dest_ip = " "
source_port = " "
dest_port = " "


def four_tuple(line):
    _search_ = re.compile(r'(\d*\.\d*.\d*.\d*)(\.\d*) > (\d*\.\d*.\d*.\d*)(\.\d*)')

    source_ip = _search_.search(line).group(1)
    source_port = _search_.search(line).group(2)

    dest_ip = _search_.search(line).group(3)
    dest_port = _search_.search(line).group(4)

    print('The Source IP and Port are:', source_ip, source_port)
    print('The Destination IP and Port are:', dest_ip, dest_port)

for read_lines in read_file:
    read_file.readline()
    four_tuple(read_lines)

到目前为止的示例输出:

The Source IP and Port are: 142.55.112.172 .1692
The Destination IP and Port are: 142.55.1.9 .80
The Source IP and Port are: 142.55.1.9 .80
The Destination IP and Port are: 142.55.117.173 .3783
The Source IP and Port are: 142.55.1.9 .80
The Destination IP and Port are: 142.55.117.173 .3783
The Source IP and Port are: 142.55.1.9 .80
The Destination IP and Port are: 142.55.117.173 .3783
The Source IP and Port are: 142.55.1.9 .80
The Destination IP and Port are: 142.55.117.173 .3783
The Source IP and Port are: 142.55.117.187 .4080
The Destination IP and Port are: 142.55.1.9 .80

现在我如何将所有重复的 IP 地址分组到一个集群中,这样它们就不会再次重复。所以像这样的东西将是理想的输出:

The Source IP and Port are: 142.55.1.9 .80
The Destination IP and Port are: 142.55.117.173 .3783
The Source IP and Port are: 142.55.1.9 .80
The Destination IP and Port are: 142.55.117.173 .3783
The Source IP and Port are: 142.55.1.9 .80
The Destination IP and Port are: 142.55.117.173 .3783
The Source IP and Port are: 142.55.1.9 .80
The Destination IP and Port are: 142.55.117.173 .3783
The Source IP and Port are: 142.55.1.9 .80
The Destination IP and Port are: 142.55.117.173 .3784
The Source IP and Port are: 142.55.1.9 .80
The Destination IP and Port are: 142.55.117.173 .3784
The Source IP and Port are: 142.55.1.9 .80
The Destination IP and Port are: 142.55.117.173 .3784
The Source IP and Port are: 142.55.1.9 .80
The Destination IP and Port are: 142.55.117.173 .3784
4

1 回答 1

0

The best way in python is to use itertools.groupby. I have re-written your example below not so I may appear clever, but to hopefully point out a few ways your script could be better, or more pythonic:

from __future__ import with_statement  # Only needed on very old python versions

import os
import re
import sys 
import itertools


path_to_file = 'tcpdump.txt'
line_parser = re.compile(r'(\d*\.\d*.\d*.\d*)(\.\d*) > (\d*\.\d*.\d*.\d*)(\.\d*)')


def parse_line(line):  # renamed from four_tuple
    match = line_parser.search(line)
    if match:
        source_ip, source_port, dest_ip, dest_port = match.groups()  # Use tuple unpacking to get the variables.
        return {  # A dictionary where each item is given it's own key: value pair.
            u'source_ip': source_ip, u'source_port': source_port,
            u'dest_ip': dest_ip, u'dest_port': dest_port,
            u'line': line,
        }
    else:
        return None  # We'll check for a bad match later!

def main():
    parsed_results = []  # We need a place to store our parsed_results from your matcher function.
    if not os.path.exists(path_to_file):  # Only if we can see the file
        print u'Error, cannot load file: {0}'.format(path_to_file)
        sys.exit(1)
    else:
        with open(path_to_file, 'r') as src_file:  # Using the with context, open that file. It will be auto closed when we're done.
            for src_line in src_file:  # for each line in the file
                results = parse_line(src_line)  # run it through your matcher function.
                if results:
                    parsed_results.append(results)  # Place the results somewhere for later grouping and reporting
                else:
                    print u'[WARNING] Unable to find results in line: "{0}"'.format(repr(src_line))  # Show us the line, without interpreting that line at all.
    # By now, we're done with the src_file so it's auto-closed because of the with statement
    if parsed_results:  # If we have any results to process
        # Sort those results by source_ip
        sort_func = lambda x: x.get(u'source_ip')  # We will sort our dictionary items by the source_ip key we extracted earlier in parse_line
        # First, we have to sort our results so source_ip's are joined together.
        sorted_results = sorted(parsed_results, key = sort_func)
        # Now, we group by source_ip using the same sort_func
        grouped_results = itertools.groupby(sorted_results, key = sort_func)
        # Here, grouped_results will return a generator which will yield results as we go over it
        #   However, it will only yield results once. You cannot continually iterate over it like a list.
        for source_ip, results in grouped_results:  # The iterator yields the grouped key, then a generator of results on each iterating
            for result in results:  # So, for each result which was previously grouped.
                print(u'The Source IP and Port are: {0} {1}'.format(result.get(u'source_ip'), result.get(u'source_port')))
                print(u'The Destination IP and Port are: {0} {1}'.format(result.get(u'dest_ip'), result.get(u'dest_port')))

if __name__ == u'__main__':  # This will only execute if this script is run from the command line, and not when it's loaded as a module.
    main()

A few things that I did differently:

  1. I used the with statement. Learn to love it, it's your friend here.
  2. Define the re.compile outside of the function. This way, it's compiled once, and simply re-used each time as opposed to being re-created each loop.
  3. There's actually no need for read_file.readline() as read_file is already iterable. See the next() method of this link for an explanation.
  4. source_ip, source_port and others can be assigned in one line using tuple unpacking.
  5. The use of the name == u'main' check. This allows the same script to be used as a module and loaded by other python scripts, and as an executable script on the command line. For more details, see What does if __name__ == "__main__": do?.

Please don't hesitate to ask questions if any of it seems advanced.

于 2013-11-14T03:20:10.170 回答