python - 在一个小 Python 脚本上处理大量数据

Question

您好，我有一个小的 python 脚本，它的工作是从 TXT 文件中读取数据并将其排序为特定删除重复并删除无意义的数据并将其放回另一个 TXT 文件中，此格式 MAC IP 号码设备

import re
f = open('frame.txt', 'r')
d = open('Result1.txt', 'w')
mac=""
ip=""
phoneName=""
phoneTel=""
lmac=""
lip=""
lphoneName=""
lphoneTel=""
lines=f.readlines()
s=0
p=0
for line in lines:
    matchObj = re.search( '(?<=Src: )[0-9a-z]{2}:[0-9a-z]{2}:[0-9a-z]{2}:[0-9a-z]{2}:[0-9a-z]{2}:[0-9a-z]{2}', line, re.M|re.I)
    if(matchObj):
            mac=matchObj.group(0)+"\t"
    matchObj = re.search( '(?<=Src: )([0-9]+)(?:\.[0-9]+){3}', line, re.M|re.I)
    if(matchObj):
            ip=matchObj.group(0)+"\t"
    if(s==1):
        s=0
        matchObj = re.search( '(?<=Value: )\d+',line,re.M|re.I)
        if(matchObj):
            phoneName=matchObj.group(0)+"\t"
    if(p==1):
        p=0
        matchObj = re.search( '(?<=Value: ).+',line,re.M|re.I)
        if(matchObj):
            phoneTel=matchObj.group(0)+"\t"  
    matchObj = re.search( '(?<=Key: user \(218)', line, re.M|re.I)
    if(matchObj):
        s=1
    matchObj = re.search( '(?<=Key: resource \(165)', line, re.M|re.I)
    if(matchObj):
        p=1
    if(mac!="" and ip!="" and phoneName!="" and phoneTel!="" and mac!=lmac and ip!=lip and phoneName!=lphoneName and phoneTel!=lphoneTel):        
        d.write(mac+" " +ip+" "+ phoneName+" "+ phoneTel)
        lmac=mac
        lip=ip
        lphoneName=phoneName
        lphoneTel=phoneTel
        d.write("\n")
    matchObj = re.search( 'Frame \d+', line, re.M|re.I)
    if(matchObj):              
        mac=""
        ip=""
        phoneName=""
        phoneTel=""        
d.close()
f.close()

这里的代码问题是代码需要处理可能达到100GB的大量数据，当我这样做时，程序会冻结并杀死自己任何想法如何解决这个问题非常感谢！

score 5 · Accepted Answer

您在一开始就读取整个文件 - 如果文件那么大，将其加载到内存中将是一个问题。尝试迭代这些行。一般来说，你会喜欢

with open(filename) as f:
    for line in f:
        # This will iterate over the lines in the file rather than read them all at once

因此，对您而言，将循环构造更改为：

for line in f:

并删除：

lines=f.readlines()

score 1 · Accepted Answer

1

使用 readline() 而不是 readlines()

readlines() 一次将整个文件读入内存。

于 2013-07-18T23:47:40.333 回答

python - 在一个小 Python 脚本上处理大量数据

2 回答 2

Related

Reference