python - 连接和排序数千个 CSV 文件

Question

我在磁盘上有数千个 csv 文件。它们每个的大小约为 ~10MB（~10K 列）。这些列中的大多数都包含实数（浮点）值。

我想通过连接这些文件来创建一个数据框。一旦我有了这个数据框，我想按前两列对其条目进行排序。

我目前有以下内容：

my_dfs = list()
for ix, file in enumerate(p_files):
    my_dfs.append(
       pd.read_csv(p_files[ix], sep=':', dtype={'c1' : np.object_, 'c2' : np.object_}))

print("Concatenating files ...")
df_merged= pd.concat(my_dfs)

print("Sorting the result by the first two columns...")
df_merged = df_merged.sort(['videoID', 'frameID'], ascending=[1, 1])

print("Saving it to disk ..")
df_merged.to_csv(p_output, sep=':', index=False)

但这需要太多内存，以至于我的进程在得到结果之前被杀死（在日志中我看到该进程在使用大约 10GB 的内存时被杀死）。

我试图弄清楚它到底在哪里失败，但我仍然无法做到（尽管我希望尽快记录标准输出）

在 Pandas 中有没有更好的方法来做到这一点？

score 4 · Accepted Answer

将它们加载到数据库中很容易，可以灵活地在以后进行更改，并利用数据库中的所有优化工作。加载后，如果您想获取数据的可迭代，您可以运行以下查询并完成：

SELECT * FROM my_table ORDER BY column1, column2

我很确定有更多直接的方法可以在 sqlite3 中加载到 sqlite3，但是如果您不想直接在 sqlite 中执行此操作，您可以使用 python 来加载数据，利用 csv 阅读器作为迭代器所以您只需将最少量的内容加载到内存中，例如：

import csv
import sqlite3
conn = sqlite3.Connection(dbpath)
c = conn.cursor()

for path in paths:
    with open(path) as f:
         reader = csv.reader(f)
         c.executemany("INSERT INTO mytable VALUES (?,?,?)""", reader)

这样，您就不必在内存中加载太多，并且可以利用 sqlite。

之后（如果您想再次在 Python 中执行此操作），您可以执行以下操作：

import csv
import sqlite3
conn = sqlite3.Connection(dbpath)
c = conn.cursor()

with open(outpath) as f:
    writer = csv.writer
    writer.writerows(c.execute("SELECT * FROM mytable ORDER BY col1, col2"))

python - 连接和排序数千个 CSV 文件

1 回答 1

Related

Reference