python - 在 Python 中加载大文件

Question

我正在使用在 Ubuntu 9.04 上运行的 Python 2.6.2 [GCC 4.3.3]。我需要使用 Python 脚本逐行读取一个大数据文件（~1GB，>300 万行）。

我尝试了以下方法，我发现它使用了非常大的内存空间（~3GB）

for line in open('datafile','r').readlines():
   process(line)

或者，

for line in file(datafile):
   process(line)

有没有更好的方法来逐行加载一个大文件，比如说

a）通过明确提及文件可以在内存中的任何时间加载的最大行数？或者
b）通过按大小（例如 1024 字节）的块加载它，前提是所述块的最后一行完全加载而不被截断？

一些建议给出了我上面提到的方法并且已经尝试过，我正在尝试看看是否有更好的方法来处理这个问题。到目前为止，我的搜索还没有取得成果。我感谢您的帮助。

p/s 我已经使用了一些内存分析Heapy，发现我正在使用的 Python 代码中没有内存泄漏。

更新 2012 年 8 月 20 日，16:41 (GMT+1)

按照 JF Sebastian、mgilson 和 IamChuckB 的建议尝试了两种方法（数据文件是一个变量）

with open(datafile) as f:
    for line in f:
        process(line)

还，

import fileinput
for line in fileinput.input([datafile]):
    process(line)

奇怪的是，它们都使用了大约 3GB 的内存，我在这个测试中的数据文件大小是 765.2MB，由 21,181,079 行组成。我看到内存在稳定在 3GB 之前随时间增加（大约 40-80MB 步长）。

一个基本的疑问，使用后是否需要冲洗线路？

为了更好地理解这一点，我使用 Heapy 进行了内存分析。

1 级分析

Partition of a set of 36043 objects. Total size = 5307704 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  15934  44  1301016  25   1301016  25 str
     1     50   0   628400  12   1929416  36 dict of __main__.NodeStatistics
     2   7584  21   620936  12   2550352  48 tuple
     3    781   2   590776  11   3141128  59 dict (no owner)
     4     90   0   278640   5   3419768  64 dict of module
     5   2132   6   255840   5   3675608  69 types.CodeType
     6   2059   6   247080   5   3922688  74 function
     7   1716   5   245408   5   4168096  79 list
     8    244   1   218512   4   4386608  83 type
     9    224   1   213632   4   4600240  87 dict of type
<104 more rows. Type e.g. '_.more' to view.>

==================================================== ==========

级别 1 索引 0 的级别 2 分析

Partition of a set of 15934 objects. Total size = 1301016 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0   2132  13   274232  21    274232  21 '.co_code'
     1   2132  13   189832  15    464064  36 '.co_filename'
     2   2024  13   114120   9    578184  44 '.co_lnotab'
     3    247   2   110672   9    688856  53 "['__doc__']"
     4    347   2    92456   7    781312  60 '.func_doc', '[0]'
     5    448   3    27152   2    808464  62 '[1]'
     6    260   2    15040   1    823504  63 '[2]'
     7    201   1    11696   1    835200  64 '[3]'
     8    188   1    11080   1    846280  65 '[0]'
     9    157   1     8904   1    855184  66 '[4]'
<4717 more rows. Type e.g. '_.more' to view.>

1 级索引 1 的 2 级分析

Partition of a set of 50 objects. Total size = 628400 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0     50 100   628400 100    628400 100 '.__dict__'

1 级索引 2 的 2 级分析

Partition of a set of 7584 objects. Total size = 620936 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0   1995  26   188160  30    188160  30 '.co_names'
     1   2096  28   171072  28    359232  58 '.co_varnames'
     2   2078  27   157608  25    516840  83 '.co_consts'
     3    261   3    21616   3    538456  87 '.__mro__'
     4    331   4    21488   3    559944  90 '.__bases__'
     5    296   4    20216   3    580160  93 '.func_defaults'
     6     55   1     3952   1    584112  94 '.co_freevars'
     7     47   1     3456   1    587568  95 '.co_cellvars'
     8     35   0     2560   0    590128  95 '[0]'
     9     27   0     1952   0    592080  95 '.keys()[0]'
<189 more rows. Type e.g. '_.more' to view.>

1 级索引 3 的 2 级分析

Partition of a set of 781 objects. Total size = 590776 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0      1   0    98584  17     98584  17 "['locale_alias']"
     1     29   4    35768   6    134352  23 '[180]'
     2     28   4    34720   6    169072  29 '[90]'
     3     30   4    34512   6    203584  34 '[270]'
     4     27   3    33672   6    237256  40 '[0]'
     5     25   3    26968   5    264224  45 "['data']"
     6      1   0    24856   4    289080  49 "['windows_locale']"
     7     64   8    20224   3    309304  52 "['inters']"
     8     64   8    17920   3    327224  55 "['galog']"
     9     64   8    17920   3    345144  58 "['salog']"
<84 more rows. Type e.g. '_.more' to view.>

==================================================== ==========

级别 2-索引 0、级别 1-索引 0 的级别 3 分析

Partition of a set of 2132 objects. Total size = 274232 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0   2132 100   274232 100    274232 100 '.co_code'

级别 2-索引 0、级别 1-索引 1 的级别 3 分析

Partition of a set of 50 objects. Total size = 628400 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0     50 100   628400 100    628400 100 '.__dict__'

级别 2-索引 0、级别 1-索引 2 的级别 3 分析

Partition of a set of 1995 objects. Total size = 188160 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0   1995 100   188160 100    188160 100 '.co_names'

级别 2-索引 0、级别 1-索引 3 的级别 3 分析

Partition of a set of 1 object. Total size = 98584 bytes.
 Index  Count   %     Size   % Cumulative  % Referred Via:
     0      1 100    98584 100     98584 100 "['locale_alias']"

仍在解决此问题。

如果您以前遇到过这种情况，请与我分享。

谢谢你的帮助。

更新 2012 年 8 月 21 日，01:55 (GMT+1)

mgilson，process 函数用于对 Network Simulator 2 (NS2) 跟踪文件进行后处理。跟踪文件中的某些行共享如下。我在 python 脚本中使用了大量的对象、计数器、元组和字典来了解无线网络的性能。

s 1.231932886 _25_ AGT  --- 0 exp 10 [0 0 0 0 Y Y] ------- [25:0 0:0 32 0 0] 
s 1.232087886 _25_ MAC  --- 0 ARP 86 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776108 _42_ MAC  --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776625 _34_ MAC  --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776633 _9_ MAC  --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232776658 _0_ MAC  --- 0 ARP 28 [0 ffffffff 67 806 Y Y] ------- [REQUEST 103/25 0/0]
r 1.232856942 _35_ MAC  --- 0 ARP 28 [0 ffffffff 64 806 Y Y] ------- [REQUEST 100/25 0/0]
s 1.232871658 _0_ MAC  --- 0 ARP 86 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25]
r 1.233096712 _29_ MAC  --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097047 _4_ MAC  --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097050 _26_ MAC  --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233097051 _1_ MAC  --- 0 ARP 28 [0 ffffffff 66 806 Y Y] ------- [REQUEST 102/25 0/0]
r 1.233109522 _25_ MAC  --- 0 ARP 28 [13a 67 1 806 Y Y] ------- [REPLY 1/0 103/25]
s 1.233119522 _25_ MAC  --- 0 ACK 38 [0 1 67 0 Y Y] 
r 1.233236204 _17_ MAC  --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0]
r 1.233236463 _20_ MAC  --- 0 ARP 28 [0 ffffffff 65 806 Y Y] ------- [REQUEST 101/25 0/0]
D 1.233236694 _18_ MAC  COL 0 ARP 86 [0 ffffffff 65 806 67 1] ------- [REQUEST 101/25 0/0]

使用 Heapy 进行 3 级分析的目的是帮助我缩小占用大量内存的对象的范围。如您所见，不幸的是，我看不出哪个特别需要调整，因为它太通用了。示例我知道虽然“主要.NodeStatistics 的字典”在 36043 个（0.1%）对象中只有 50 个对象，但它占用了用于运行脚本的总内存的 12%，但我无法找到我会使用的特定字典需要调查。
我尝试如下实施 David Eyk 的建议（片段），尝试在每 500,000 行手动收集垃圾，

import gc
  for i,line in enumerate(file(datafile)):
    if (i%500000==0):
      print '-----------This is line number', i
      collected = gc.collect()
      print "Garbage collector: collected %d objects." % (collected)

不幸的是，内存使用量仍为 3GB，输出（片段）如下，

-----------This is line number 0
Garbage collector: collected 0 objects.
-----------This is line number 500000
Garbage collector: collected 0 objects.

执行了 martineau 的建议，我看到内存使用量现在从之前的 3GB 增加到了 22MB！我期待实现的目标。奇怪的是下面，

我做了和以前一样的内存分析，

1 级分析

Partition of a set of 35474 objects. Total size = 5273376 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  15889  45  1283640  24   1283640  24 str
     1     50   0   628400  12   1912040  36 dict of __main__.NodeStatistics
     2   7559  21   617496  12   2529536  48 tuple
     3    781   2   589240  11   3118776  59 dict (no owner)
     4     90   0   278640   5   3397416  64 dict of module
     5   2132   6   255840   5   3653256  69 types.CodeType
     6   2059   6   247080   5   3900336  74 function
     7   1716   5   245408   5   4145744  79 list
     8    244   1   218512   4   4364256  83 type
     9    224   1   213632   4   4577888  87 dict of type
<104 more rows. Type e.g. '_.more' to view.>

将之前的内存分析输出与上述比较，str 减少了 45 个对象（17376 字节），tuple 减少了 25 个对象（3440 字节）和 dict(no owner) 虽然没有对象更改，但它减少了 1536 字节的内存大小。所有其他对象都是相同的，包括main .NodeStatistics 的 dict。对象总数为 35474。对象的小幅减少 (0.2%) 节省了 99.3% 的内存（从 3GB 节省了 22MB）。很奇怪。

如果您意识到，虽然我知道内存不足发生的位置，但我仍然能够缩小导致出血的原因。

将继续对此进行调查。

感谢所有的指点，利用这个机会学习了很多关于 python 的知识，因为我不是专家。感谢您花时间帮助我。

2012 年 8 月 23 日更新，00:01 (GMT+1) -- 已解决

根据 martineau 的建议，我继续使用简约代码进行调试。我开始在进程函数中添加代码并观察内存出血。
当我添加如下类时，我发现内存开始流血，

class PacketStatistics(object):
    def __init__(self):
        self.event_id = 0
        self.event_source = 0
        self.event_dest = 0
        ...

我正在使用 3 个具有 136 个计数器的类。

和我的朋友 Gustavo Carneiro 讨论了这个问题，他建议使用slot来替换 dict。
我将课程转换如下，

class PacketStatistics(object):
    __slots__ = ('event_id', 'event_source', 'event_dest',...)
    def __init__(self):
        self.event_id = 0
        self.event_source = 0
        self.event_dest = 0
        ...

当我转换所有 3 个类时，之前 3GB 的内存使用量现在变成了 504MB。节省高达 80% 的内存使用量！！
下面是 dict 到slot转换后的内存分析。

Partition of a set of 36157 objects. Total size = 4758960 bytes.
 Index  Count   %     Size   % Cumulative  % Kind (class / dict of class)
     0  15966  44  1304424  27   1304424  27 str
     1   7592  21   624776  13   1929200  41 tuple
     2    780   2   587424  12   2516624  53 dict (no owner)
     3     90   0   278640   6   2795264  59 dict of module
     4   2132   6   255840   5   3051104  64 types.CodeType
     5   2059   6   247080   5   3298184  69 function
     6   1715   5   245336   5   3543520  74 list
     7    225   1   232344   5   3775864  79 dict of type
     8    244   1   223952   5   3999816  84 type
     9    166   0   190096   4   4189912  88 dict of class
<101 more rows. Type e.g. '_.more' to view.>

dict of __main__.NodeStatistics已经不在前十了。

我对结果很满意，也很高兴结束这个问题。

感谢您的所有指导。真的很感激。

rgds 萨拉瓦南 K

score 13 · Accepted Answer

with open('datafile') as f:
    for line in f:
        process(line)

这是有效的，因为文件是迭代器，一次产生 1 行，直到没有更多行可以产生。

score 4 · Accepted Answer

该fileinput模块将让您逐行读取它，而无需将整个文件加载到内存中。pydocs

import fileinput
for line in fileinput.input(['myfile']):
do_something(line)

代码示例取自yak.net

score 0 · Accepted Answer

@mgilson 的回答是正确的。不过，这个简单的解决方案值得官方提及（@HerrKaputt 在评论中提到了这一点）

file = open('datafile')
for line in file:
    process(line)
file.close()

这很简单，pythonic，并且可以理解。如果您不了解其with工作原理，请使用它。

正如另一张海报提到的那样，这不会创建像 file.readlines() 这样的大列表。相反，它以传统的 unix 文件/管道的方式一次拉出一行。

score 0 · Accepted Answer

如果文件是 JSON、XML、CSV、基因组学或任何其他众所周知的格式，则有专门的阅读器直接使用 C 代码，并且比在本机 Python 中解析更优化速度和内存 - 尽可能避免在本机解析它.

但总的来说，我的经验提示：

Python 的multiprocessing包非常适合管理子进程，当子进程结束时，所有内存泄漏都会消失。
将阅读器子进程作为 a 运行multiprocessing.Process并使用 amultiprocessing.Pipe(duplex=True)进行通信（发送文件名和任何其他参数，然后读取其标准输出）
读取小的（但不是微小的）块，比如 64Kb-1Mb。更适合内存使用，也适合其他正在运行的进程/子进程的响应

python - 在 Python 中加载大文件

4 回答 4

Related

Reference