2

我正在尝试解析一个包含 WHOIS 信息的非常大的文件(文件 > 4G)。

我只需要文件中包含的信息子集。

目标是以 JSON 格式输出一些感兴趣的 WHOIS 字段。

#
# The contents of this file are subject to
# RIPE Database Terms and Conditions
#
# http://www.ripe.net/db/support/db-terms-conditions.pdf
#

inetnum:        10.16.151.184 - 10.16.151.191
netname:        NETECONOMY-MG41731 ENTRY 1
descr:          DUMMY FOO ENTRY 1
country:        IT ENTRY 1
admin-c:        DUMY-RIPE
tech-c:         DUMY-RIPE
status:         ASSIGNED PA
notify:         neteconomy.rete@example.com
mnt-by:         INTERB-MNT
changed:        unread@xxx..net 20000101
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

% Tags relating to '80.16.151.184 - 80.16.151.191'
% RIPE-USER-RESOURCE

inetnum:        20.16.151.180 - 20.16.151.183
netname:        NETECONOMY-MG41731 ENTRY 2
descr:          DUMMY FOO ENTRY 2
country:        IT ENTRY 2
admin-c:        DUMY-RIPE
tech-c:         DUMY-RIPE
status:         ASSIGNED PA
notify:         neteconomy.rete@xxx.it
mnt-by:         INTERB-MNT
changed:        unread@xxx.net 20000101
source:         RIPE
remarks:        ****************************
remarks:        * THIS OBJECT IS MODIFIED
remarks:        * Please note that all data that is generally regarded as personal
remarks:        * data has been removed from this object.
remarks:        * To view the original object, please query the RIPE Database at:
remarks:        * http://www.ripe.net/whois
remarks:        ****************************

我正在使用下面的代码进行解析和信息检索,我确信它远未优化,我可以以更有效的方式获得类似的结果。

def create_json2():
    regex_inetnum = r'inetnum:\s+(?P<inetnum_val>.*)'
    regex_netname = r'netname:\s+(?P<netname_val>.*)'
    regex_country = r'country:\s+(?P<country_val>.*)'
    regex_descr = r'descr:\s+(?P<descr_val>.*)'
    inetnum_list = []
    netname_list = []
    country_list = []
    descr_list = []
    records = []
    with open(RIPE_DB, "r") as f:
        for line in f:
            inetnum = re.search(regex_inetnum, line, re.IGNORECASE)
            netname = re.search(regex_netname, line, re.IGNORECASE)
            country = re.search(regex_country, line, re.IGNORECASE)
            descr = re.search(regex_descr, line, re.IGNORECASE)
            if inetnum is not None:
                inetnum_val = inetnum.group("inetnum_val").strip()
                inetnum_list.append(inetnum_val)
            if netname is not None:
                netname_val = netname.group("netname_val").strip()
                netname_list.append(netname_val)
            if country is not None:
                country_val = country.group("country_val").strip()
                country_list.append(country_val)
            if descr is not None:
                descr_val = descr.group("descr_val").strip()
                descr_list.append(descr_val)

        for i,n,d,c in zip(inetnum_list, netname_list, descr_list, country_list):
            data = {'inetnum': i, 'netname': n.upper(), 'descr': d.upper(), 'country': c.upper()}
            records.append(data)   
    print json.dumps(records, indent=4)

create_json2()

当我开始解析文件时,它会在一段时间后停止并出现以下错误。

$> ./parse.py
Killed

在文件处理过程中,RAM/CPU 负载非常高。

相同的代码按预期工作,并且在较小的文件上没有错误。

为了能够解析这个超过 4G 的文件并提高代码逻辑和质量,您有什么建议吗?

4

1 回答 1

1

神奇的词是“刷新”,您需要尽快从 Python 中获取该数据(最好以批处理方式)。

#!/usr/bin/env python

import shelve

db = shelve.open('ipnum.db')

def split_line(line):
    line = line.split(':')
    key = line[0]
    value = ':'.join(line[1:]).strip()
    return key, value

def parse_entry(f):
    entry = {}
    for line in f:
        line = line.strip()
        if len(line) < 5:
            break

        key, value = split_line(line)
        if key not in entry:
            entry[key] = value
        elif key in entry:
            if not isinstance(entry[key], list):
                entry[key] = [entry[key]]
            entry[key].append(value)

    return entry

def parse_file(file_path):
    i = 0
    with open(file_path) as f:
        for line in f:
            if line.startswith('inetnum'):
                inetnum = split_line(line)[1]
                entry = parse_entry(f)
                db[inetnum] = entry

                if i == 250000:
                    print 'done with 250k'
                    db.sync()
                    i = 0

                i += 1

    db.close()

if __name__ == '__main__':
    parse_file('ripe.db.inetnum')

该脚本会将整个数据库保存到名为 ipnum.db 的数据库中,您可以轻松更改输出目标以及刷新频率。

db.sync() 有点用于显示,因为 bsddb 会自动刷新这些数据量。

于 2013-09-05T11:51:42.743 回答