python - Python 在迭代处理我的 1GB csv 文件时停止

Question

我有两个文件：

metadata.csv：包含一个 ID，后跟供应商名称、文件名等
hashes.csv：包含一个 ID，后跟一个哈希 ID 本质上是一种外键，将文件元数据与其哈希相关联。

我编写了这个脚本来快速提取与特定供应商相关的所有哈希值。它在完成处理 hashes.csv 之前就崩溃了

stored_ids = []

# this file is about 1 MB
entries = csv.reader(open(options.entries, "rb"))

for row in entries:
  # row[2] is the vendor
  if row[2] == options.vendor:
    # row[0] is the ID
    stored_ids.append(row[0])

# this file is 1 GB
hashes = open(options.hashes, "rb")

# I iteratively read the file here,
# just in case the csv module doesn't do this.
for line in hashes:

  # not sure if stored_ids contains strings or ints here...
  # this probably isn't the problem though
  if line.split(",")[0] in stored_ids:

    # if its one of the IDs we're looking for, print the file and hash to STDOUT
    print "%s,%s" % (line.split(",")[2], line.split(",")[4])

hashes.close()

该脚本在停止之前通过 hashes.csv 获取大约 2000 个条目。我究竟做错了什么？我以为我正在逐行处理它。

附言。csv 文件是流行的 HashKeeper 格式，我正在解析的文件是 NSRL 哈希集。http://www.nsrl.nist.gov/Downloads.htm#converter

更新：下面的工作解决方案。感谢所有评论的人！

entries = csv.reader(open(options.entries, "rb"))   
stored_ids = dict((row[0],1) for row in entries if row[2] == options.vendor)

hashes = csv.reader(open(options.hashes, "rb"))
matches = dict((row[2], row[4]) for row in hashes if row[0] in stored_ids)

for k, v in matches.iteritems():
    print "%s,%s" % (k, v)

score 3 · Accepted Answer

“胡扯”并不是一个特别好的描述。它有什么作用？它交换吗？填满所有内存？或者只是吃CPU而不做任何事情？

但是，作为开始，请使用字典而不是列表stored_ids。在字典中搜索通常在 O(1) 时间内完成，而在列表中搜索是 O(n)。

编辑：这是一个微不足道的微基准：

$ python -m timeit -s "l=range(1000000)" "1000001 in l"
10 loops, best of 3: 71.1 msec per loop
$ python -m timeit -s "s=set(range(1000000))" "1000001 in s"
10000000 loops, best of 3: 0.174 usec per loop

如您所见，一个集合（具有与 dict 相同的性能特征）在 100 万个整数中的搜索速度比类似列表快 10000 倍（远小于 1 微秒，而每次查找几乎 100 毫秒）。考虑到对 1GB 文件的每一行都进行了这样的查找，并且您了解问题的严重性。

score 0 · Accepted Answer

此代码将在任何没有至少 4 个逗号的行上消失；例如，它会死在一条空行上。如果您确定不想使用 csv 阅读器，那么至少要IndexError赶上line.split(',')[4]

score 0 · Accepted Answer

请解释一下停止是什么意思？它挂起还是退出？是否有任何错误回溯？

a）它会在任何没有“，”的行上失败

>>> 'hmmm'.split(",")[2]
Traceback (most recent call last):
  File "<string>", line 1, in <string>
IndexError: list index out of range

b）为什么要多次拆分线路，而不是这样做

tokens = line.split(",")

if len(tokens) >=5 and tokens[0] in stored_ids:
    print "%s,%s" % (tokens[2], tokens[4])

c）创建一个stored_ids的字典，所以stored_id中的tokens[0]会很快

d) 将你的内部代码包装在 try/exept 中，看看是否有任何错误

e）您在命令行或某个 IDE 上在哪里运行它？

score 0 · Accepted Answer

在数组中搜索需要 O(n)，所以请改用 dict

stored_ids = dict((row[0],1) for row in entries if row[2] == options.vendor)

或使用集合

a=set(row[0] for row in entries if row[2] == options.vendor)
b=set(line.split(",")[0] for line in hashes)
c=a.intersection(b)

在c你只会找到哈希和 csv 的字符串

python - Python 在迭代处理我的 1GB csv 文件时停止

4 回答 4

Related

Reference