2

因此,我希望 Python 使用 csv 读取器/写入器获取目录中的所有 CSV 并将它们组合起来,同时过滤掉第二列中包含与任何其他行的值重复的任何行。

这是我不起作用的脚本:

import csv
import glob

with open('merged.csv','a') as out:
    seen = set()
    output = []
    out_writer = csv.writer(out)
    csv_files = [f for f in glob.glob('*.csv') if 'merged' not in f]
#csv_files = glob.glob('*.csv') 
     # I'd like to use all files including the output so that I don't
     # have to rename it when reusing the script - it should dupe-filter itself!
for filename in csv_files:
    with open(filename, 'rb') as ifile:
        read = csv.reader(ifile, delimiter=',')
        for row in read:
            if row[1] not in seen:
                seen.add(row[1])
                if row: #was getting extra rows
                    output.append(row)
out_writer.writerows(output)

我觉得我一定错过了一些简单的东西。我的文件每个大小约为 100MB,我最终希望将其自动化,以便不同的计算机可以共享一个合并文件以进行重复检查。

为了获得额外的信用,我将如何更改它以检查具有两者row[1]row[2]共同点的行?(一旦欺骗过滤器和自我包含工作,当然......)

4

2 回答 2

2

我建议使用 pandas 而不是 csv writer。我会将您的代码重写为以下内容:

import pandas as pd
import glob

data = pd.concat([pd.DataFrame.from_csv(file) for
                  file in glob.glob("*.csv")]).drop_duplicates(cols=COLNAME_LIST)
data.to_csv('merged.csv')

完全披露,我没有测试过这段代码,因为我没有大量的 csv 文件,但我之前成功地写过类似的东西

于 2013-10-31T21:37:29.283 回答
1

这不仅仅是 pandas 可能需要的少量行,因为它是普通的 Python,但另一方面它相对简单,将过滤多个列值,并处理重新读取以前的结果。它使用该fileinput模块允许将其多个输入文件视为单个连续的数据行流。

import csv
import fileinput
import glob
import os

merged_csv = 'merged.csv'
columns = (1, 2)  # columns used for filtering
pathname = '*.csv'
tmpext = os.extsep + "tmp"
csv_files = glob.glob(pathname)

if merged_csv not in csv_files:
    prev_merged = None
else:
    prev_merged = merged_csv + tmpext
    os.rename(merged_csv, prev_merged)
    csv_files[csv_files.index(merged_csv)] = prev_merged

with open(merged_csv, 'wb') as ofile:
    csv_writer = csv.writer(ofile)
    written = set()  # unique combinations of column values written
    csv_stream = fileinput.input(csv_files, mode='rb')
    for row in csv.reader(csv_stream, delimiter=','):
        combination = tuple(row[col] for col in columns)
        if combination not in written:
            csv_writer.writerow(row)
            written.add(combination)

if prev_merged:
    os.unlink(prev_merged)  # clean up

print '{!r} file {}written'.format(merged_csv, 're' if prev_merged else '')
于 2013-10-31T23:44:45.360 回答