python - Python - CSV：具有不同长度行的大文件

Question

简而言之，我有一个 20,000,000 行的 csv 文件，它具有不同的行长。这是由于过时的数据记录器和专有格式。我们得到以下格式的 csv 文件的最终结果。我的目标是将此文件插入到 postgres 数据库中。我该如何执行以下操作：

保留前 8 列和我的最后 2 列，以获得一致的 CSV 文件
在第一个或最后一个位置向 csv 文件 ether 添加一个新列。

1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0, img_id.jpg, -50
1, 2, 3, 4, 5, 0,0,0,0,0,0,0,0,0,0,0 img_id.jpg, -50

score 8 · Accepted Answer

用读取一行csv，然后：

newrow = row[:8] + row[-2:]

然后添加您的新字段并将其写出来（也用csv）。

score 2 · Accepted Answer

您可以将文件作为文本文件打开并一次读取一行。是否存在不“拆分字段”的引用或转义逗号？如果没有，你可以做

with open('thebigfile.csv', 'r') as thecsv:
    for line in thecsv:
        fields = [f.strip() for f in thecsv.split(',')]
        consist = fields[:8] + fields[-2:] + ['onemore']
        ... use the `consist` list as warranted ...

我怀疑我在哪里+ ['onemore']你可能想“添加一个专栏”，正如你所说，有一些非常不同的内容，但我当然无法猜测它可能是什么。

不要将每一行与插入分开发送到数据库 - 2000 万次插入将需要很长时间。相反，将“制造一致”列表分组，将它们附加到一个临时列表中——每次该列表的长度达到 1000 时，使用 anexecutemany添加所有这些条目。

编辑：澄清一下，我不建议使用csv来处理您知道不是“正确” csv 格式的文件：直接处理它可以让您更直接地控制（尤其是当您发现每个逗号数量不同之外的其他违规行为时线）。

score 1 · Accepted Answer

我建议使用该csv模块。这是我在其他地方完成的一些基于 CSV 处理的代码

from __future__ import with_statement
import csv

def process( reader, writer):
    for line in reader:
        data = row[:8] + row[-2:]
        writer.write( data )

def main( infilename, outfilename ):
    with open( infilename, 'rU' ) as infile:
        reader = csv.reader( infile )
        with open( outfilename, 'w') as outfile:
            writer = csv.writer( outfile )
            process( reader, writer )

if __name__ == '__main__':
    if len(sys.argv) != 3:
        print "syntax: python process.py filename outname"
        sys.exit(1)
    main( sys.argv[1], sys.argv[2] )

score 1 · Accepted Answer

对不起，你需要用这个写一些代码。当您有这样一个大文件时，值得检查所有文件以确保它与您的期望一致。如果您将不满意的数据放入数据库，您将永远无法将其全部取出。

记住关于 CSV 的奇怪之处：它是一堆类似标准的混搭，在引用、转义、空字符、unicode、空字段 (",,,")、多行输入和空行方面有不同的规则。csv 模块具有“方言”和选项，您可能会发现 csv.Sniffer 类很有帮助。

我推荐你：

运行“tail”命令查看最后几行。
如果它看起来表现良好，请通过 csv 阅读器运行整个文件以查看它是否损坏。制作“每行字段”的快速直方图。
考虑“有效”范围和字符类型，并在阅读时严格检查它们。特别注意不寻常的 unicode 或可打印范围之外的字符。
认真考虑是否要将额外的奇数球值保留在“行的其余部分”文本字段中。
将任何意外的行扔到异常文件中。
修复您的代码以处理异常文件中的新模式。冲洗。重复。
最后，再次运行整个过程，实际上是将数据转储到数据库中。

从不接触数据库直到完全完成，您的开发时间会更快。此外，请注意 SQLite 在只读数据上的速度非常快，因此 PostGres 可能不是最佳解决方案。

您的最终代码可能看起来像这样，但如果不知道您的数据，我无法确定，尤其是它的“表现良好”：

while not eof
    out = []
    for chunk in range(1000):
       try:
          fields = csv.reader.next()
       except StopIteration:
          break
       except:
          print str(reader.line_num) + ", 'failed to parse'"
       try:
          assert len(fields) > 5 and len(fields < 12)
          assert int(fields[3]) > 0 and int(fields[3]) < 999999
          assert int(fields[4]) >= 1 and int(fields[4] <= 12) # date
          assert field[5] == field[5].strip()  # no extra whitespace
          assert not field[5].strip(printable_chars)  # no odd chars
          ...
       except AssertionError:
          print str(reader.line_num) + ", 'failed checks'"
       new_rec = [reader.line_num]  # new first item
       new_rec.extend(fields[:8])   # first eight
       new_rec.extend(fields[-2:])  # last two
       new_rec.append(",".join(field[8:-2])) # and the rest
       out.append(new_rec)
    if database:
       cursor.execute_many("INSERT INTO raw_table VALUES %d,...", out)

当然，您的里程会因此代码而异。这是伪代码的初稿。预计为输入编写可靠的代码需要一天的大部分时间。

python - Python - CSV：具有不同长度行的大文件

4 回答 4

Related

Reference