我正在尝试解析一个包含 WHOIS 信息的非常大的文件(文件 > 4G)。
我只需要文件中包含的信息子集。
目标是以 JSON 格式输出一些感兴趣的 WHOIS 字段。
#
# The contents of this file are subject to
# RIPE Database Terms and Conditions
#
# http://www.ripe.net/db/support/db-terms-conditions.pdf
#
inetnum: 10.16.151.184 - 10.16.151.191
netname: NETECONOMY-MG41731 ENTRY 1
descr: DUMMY FOO ENTRY 1
country: IT ENTRY 1
admin-c: DUMY-RIPE
tech-c: DUMY-RIPE
status: ASSIGNED PA
notify: neteconomy.rete@example.com
mnt-by: INTERB-MNT
changed: unread@xxx..net 20000101
source: RIPE
remarks: ****************************
remarks: * THIS OBJECT IS MODIFIED
remarks: * Please note that all data that is generally regarded as personal
remarks: * data has been removed from this object.
remarks: * To view the original object, please query the RIPE Database at:
remarks: * http://www.ripe.net/whois
remarks: ****************************
% Tags relating to '80.16.151.184 - 80.16.151.191'
% RIPE-USER-RESOURCE
inetnum: 20.16.151.180 - 20.16.151.183
netname: NETECONOMY-MG41731 ENTRY 2
descr: DUMMY FOO ENTRY 2
country: IT ENTRY 2
admin-c: DUMY-RIPE
tech-c: DUMY-RIPE
status: ASSIGNED PA
notify: neteconomy.rete@xxx.it
mnt-by: INTERB-MNT
changed: unread@xxx.net 20000101
source: RIPE
remarks: ****************************
remarks: * THIS OBJECT IS MODIFIED
remarks: * Please note that all data that is generally regarded as personal
remarks: * data has been removed from this object.
remarks: * To view the original object, please query the RIPE Database at:
remarks: * http://www.ripe.net/whois
remarks: ****************************
我正在使用下面的代码进行解析和信息检索,我确信它远未优化,我可以以更有效的方式获得类似的结果。
def create_json2():
regex_inetnum = r'inetnum:\s+(?P<inetnum_val>.*)'
regex_netname = r'netname:\s+(?P<netname_val>.*)'
regex_country = r'country:\s+(?P<country_val>.*)'
regex_descr = r'descr:\s+(?P<descr_val>.*)'
inetnum_list = []
netname_list = []
country_list = []
descr_list = []
records = []
with open(RIPE_DB, "r") as f:
for line in f:
inetnum = re.search(regex_inetnum, line, re.IGNORECASE)
netname = re.search(regex_netname, line, re.IGNORECASE)
country = re.search(regex_country, line, re.IGNORECASE)
descr = re.search(regex_descr, line, re.IGNORECASE)
if inetnum is not None:
inetnum_val = inetnum.group("inetnum_val").strip()
inetnum_list.append(inetnum_val)
if netname is not None:
netname_val = netname.group("netname_val").strip()
netname_list.append(netname_val)
if country is not None:
country_val = country.group("country_val").strip()
country_list.append(country_val)
if descr is not None:
descr_val = descr.group("descr_val").strip()
descr_list.append(descr_val)
for i,n,d,c in zip(inetnum_list, netname_list, descr_list, country_list):
data = {'inetnum': i, 'netname': n.upper(), 'descr': d.upper(), 'country': c.upper()}
records.append(data)
print json.dumps(records, indent=4)
create_json2()
当我开始解析文件时,它会在一段时间后停止并出现以下错误。
$> ./parse.py
Killed
在文件处理过程中,RAM/CPU 负载非常高。
相同的代码按预期工作,并且在较小的文件上没有错误。
为了能够解析这个超过 4G 的文件并提高代码逻辑和质量,您有什么建议吗?