我有 100 个 CSV 文件,它们都包含来自不同时间段的相似信息。我只需要从每个时间段中提取某些信息,不需要将所有数据存储到内存中。
现在我使用的东西看起来像:
import pandas as pd
import numpy as np
import glob
average_distance = []
for files in glob.glob("*2013-Jan*"): # Here I'm only looking at one file
data = pd.read_csv(files)
average_distance.append(np.mean(data['DISTANCE']))
rows = data[np.logical_or(data['CANCELLED'] == 1, data['DEP_DEL15'] == 1)]
del data
我的问题是:有没有办法使用生成器来做到这一点,如果是这样,这是否会加快进程,让我轻松浏览 100 个 CSV 文件?
我认为这可能是在正确的轨道上:
def extract_info():
average_distance = []
for files in glob.glob("*20*"):
data = pd.read_csv(files)
average_distance.append(np.mean(data['DISTANCE']))
rows = data[np.logical_or(data['CANCELLED'] == 1, data['DEP_DEL15'] == 1)]
yield rows
cancelled_or_delayed = [month for month in extract_info()]
谢谢!