python - python csv扭曲告诉

Question

我试图在阅读 csv 文件时找到我所在位置的百分比。我知道如何使用带有文件对象的 tell() 来执行此操作，但是当我使用 csv.reader 读取该文件对象时，然后在我的阅读器对象中的行上执行 for 循环，tell() 函数总是返回，好像它位于文件的末尾，无论我在循环中的哪个位置。我怎样才能找到我在哪里？

当前代码：

with open(FILE_PERSON, 'rb') as csvfile:
    spamreader = csv.reader(csvfile)
    justtesting = csvfile.tell()
    size = os.fstat(csvfile.fileno()).st_size
    for row in spamreader:
        pos = csvfile.tell()
        print pos, "of", size, "|", justtesting

我在那里扔了“justtesting”只是为了证明tell()在我开始我的for循环之前确实返回0。

这将为我的 csv 文件中的每一行返回相同的内容：579 of 579 | 0

我究竟做错了什么？

score 3 · Accepted Answer

该csv库在读取文件时会使用缓冲区，因此文件指针会跳转到更大的块中。它不会逐行读取您的文件。

它以更大的块读取数据以使解析更容易，并且由于换行符可以嵌入引号中，因此无法逐行读取 CSV 数据。

如果你必须给出进度报告，那么你需要预先计算行数。仅当您的输入 CSV 文件未在列值中嵌入换行符时，以下内容才有效：

with open(FILE_PERSON, 'rb') as csvfile:
    linecount = sum(1 for _ in csvfile)
    csvfile.seek(0)
    spamreader = csv.reader(csvfile)
    for line, row in enumerate(spamreader):
        print '{} of {}'.format(line, linecount)

还有其他计算行数的方法（请参阅如何在 Python 中廉价地获取行数？），但由于您将读取文件以将其作为 CSV 进行处理，因此您不妨使用打开的文件为此。我不确定将文件作为内存映射打开，然后再次将其作为普通文件读取是否会更好。

score 0 · Accepted Answer

csvreader文档说：

... csvfile 可以是任何支持迭代器协议并在每次调用其 next() 方法时返回一个字符串的对象...

因此，对 OP 的原始代码进行了小改动：

import csv
import os
filename = "tar.data"
with open(filename, 'rb') as csvfile:
    spamreader = csv.reader(csvfile)
    justtesting = csvfile.tell()
    size = os.fstat(csvfile.fileno()).st_size
    for row in spamreader:
        pos = csvfile.tell()
        print pos, "of", size, "|", justtesting
###############################################
def generator(csvfile):
    # readline seems to be the key
    while True:
        line = csvfile.readline()
        if not line:
            break
        yield line
###############################################
print
with open(filename, 'rb', 0) as csvfile:
    spamreader = csv.reader(generator(csvfile))
    justtesting = csvfile.tell()
    size = os.fstat(csvfile.fileno()).st_size
    for row in spamreader:
        pos = csvfile.tell()
        print pos, "of", size, "-", justtesting

对我的测试数据运行它会得到以下结果，表明两种不同的方法会产生不同的结果。

224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0
224 of 224 | 0

16 of 224 - 0
32 of 224 - 0
48 of 224 - 0
64 of 224 - 0
80 of 224 - 0
96 of 224 - 0
112 of 224 - 0
128 of 224 - 0
144 of 224 - 0
160 of 224 - 0
176 of 224 - 0
192 of 224 - 0
208 of 224 - 0
224 of 224 - 0

我在上设置了零缓冲，open但没有任何区别，事情readline在生成器中。

python - python csv扭曲告诉

2 回答 2

Related

Reference