我有一个非常大的 csv 文件(5 GB),所以我不想将整个内容加载到内存中,我想删除它的一个或多个列。我尝试在 blaze 中使用以下代码,但它所做的只是将结果列附加到现有的 csv 文件中:
from blaze import Data, odo
d = Data("myfile.csv")
d = d[columns_I_want_to_keep]
odo(d, "myfile.csv")
有没有办法使用 pandas 或 blaze 只保留我想要的列并删除其他列?
您可以使用dask.dataframe
,它在语法上与 pandas 相似,但在内核外进行操作,因此内存不应该成为问题。它还会自动并行化该过程,因此它应该很快。
import dask.dataframe as dd
df = dd.read_csv('myfile.csv', usecols=['col1', 'col2', 'col3'])
df.to_csv('output.csv', index=False)
计时
到目前为止,我已经在一个 1.4 GB 的 csv 文件上发布了每种方法的时间。我保留了四列,将输出 csv 文件保留为 250 MB。
使用达斯克:
%%timeit
df = dd.read_csv(f_in, usecols=cols_to_keep)
df.to_csv(f_out, index=False)
1 loop, best of 3: 41.8 s per loop
使用熊猫:
%%timeit
chunksize = 10**5
for chunk in pd.read_csv(f_in, chunksize=chunksize, usecols=cols_to_keep):
chunk.to_csv(f_out, mode='a', index=False)
1 loop, best of 3: 44.2 s per loop
使用 Python/CSV:
%%timeit
inc_f = open(f_in, 'r')
csv_r = csv.reader(inc_f)
out_f = open(f_out, 'w')
csv_w = csv.writer(out_f, delimiter=',', lineterminator='\n')
for row in csv_r:
new_row = [row[1], row[5], row[6], row[8]]
csv_w.writerow(new_row)
inc_f.close()
out_f.close()
1 loop, best of 3: 1min 1s per loop
我会这样做:
cols2keep = ['col1','col3','col4','col6'] # columns you want to have in the resulting CSV file
chunksize = 10**5 # you may want to adjust it ...
for chunk in pd.read_csv(filename, chunksize=chunksize, usecols=cols2keep):
chunk.to_csv('output.csv', mode='a', index=False)
PS 如果适合您,您可能还想考虑从 CSV 迁移到 PyTables (HDF5)...
每次将新块保存到磁盘时,按块读取原始 CSV 并附加到新文件都会打印标题。可以通过以下方式避免:
cols_to_keep = ['col1', 'col2'] # or [0, 1]
add_header = True
chunksize = 10**5
for chunk in pd.read_csv(f_in, chunksize=chunksize, usecols=cols_to_keep):
chunk.to_csv(f_out, mode='a', index=False, header=add_header)
if add_header:
# The header should not be printed more than one
add_header = False
我经常处理大型 csv 文件。这是我的解决方案:
import csv
fname_in = r'C:\mydir\myfile_in.csv'
fname_out = r'C:\mydir\myfile_out.csv'
inc_f = open(fname_in,'r') #open the file for reading
csv_r = csv.reader(inc_f) # Attach the csv "lens" to the input stream - default is excel dialect
out_f = open(fname_out,'w') #open the file for writing
csv_w = csv.writer(out_f, delimiter=',',lineterminator='\n' ) #attach the csv "lens" to the stream headed to the output file
for row in csv_r: #Loop Through each row in the input file
new_row = row[:] # initialize the output row
new_row.pop(5) #Whatever column you wanted to delete
csv_w.writerow(new_row)
inc_f.close()
out_f.close()