我正在尝试在只有 4gb RAM(不要问)的机器上处理一个大型(2gb)csv 文件,以生成一个不同的、格式化的 csv,其中包含需要一些处理的数据子集。我正在读取文件并创建一个 HDFstore,稍后我会查询输出所需的数据。一切正常,除了我无法使用 Term 从存储中检索数据 - 错误消息返回 PLOT 不是列名。个别变量看起来很好,商店是我所期望的,我只是看不到错误在哪里。(nb pandas v14 和 numpy1.9.0)。对此非常新,因此为笨拙的代码道歉。
#wibble wobble -*- coding: utf-8 -*-
# short version
def filesport():
import pandas as pd
import numpy as np
from pandas.io.pytables import Term
Location = r"CL_short.csv"
store = pd.HDFStore('blarg.h5')
maxlines = sum(1 for line in open (Location))
print maxlines
#set chunk small for test file
chunky=4
plotty =pd.DataFrame(columns=['PLOT'])
dfdum=pd.DataFrame(columns=['PLOT', 'mDate', 'D100'])
#read file in chunks to avoid RAM blowing up
bucket = pd.read_csv(Location, iterator=True, chunksize=chunky, usecols= ['PLOT','mDate','D100'])
for chunk in bucket:
store.append('wibble', chunk, format='table', data_columns=['PLOT','mDate','D100'], ignore_index=True)
#retrieve plot numbers and select unique items
plotty = store.select('wibble', "columns = ['PLOT']")
plotty.drop_duplicates(inplace=True)
#iterate through unique plots to retrieve data and put in dataframe for output
for index, row in plotty.iterrows():
dfdum = store.select('wibble', [Term('PLOT', '=', plotty.iloc[index]['PLOT'])])
#process dfdum for output to new csv
print("successful completion")
filesport()