10

I'm using pandas to do an outer merge on a set of about ~1000-2000 CSV files. Each CSV file has an identifier column id which is shared between all the CSV files, but each file has a unique set of columns of 3-5 columns. There are roughly 20,000 unique id rows in each file. All I want to do is merge these together, bringing all the new columns together and using the id column as the merge index.

I do it using a simple merge call:

merged_df = first_df # first csv file dataframe
for next_filename in filenames:
   # load up the next df
   # ...
   merged_df = merged_df.merge(next_df, on=["id"], how="outer")

The problem is that with nearly 2000 CSV files, I get a MemoryError in the merge operation thrown by pandas. I'm not sure if this is a limitation due to a problem in the merge operation?

The final dataframe would have 20,000 rows and roughly (2000 x 3) = 6000 columns. This is large, but not large enough to consume all the memory on the computer I am using which has over 20 GB of RAM. Is this size too much for pandas manipulation? Should I be using something like sqlite instead? Is there something I can change in the merge operation to make it work on this scale?

thanks.

4

3 回答 3

8

I think you'll get better performance using a concat (which acts like an outer join):

dfs = (pd.read_csv(filename).set_index('id') for filename in filenames)
merged_df = pd.concat(dfs, axis=1)

This means you are doing only one merge operation rather than one for each file.

于 2013-06-19T19:04:37.117 回答
0

I met same error in 32-bit pytwhen using read_csv with 1GB file. Try 64-bit version and hopefully will solve Memory Error problem

于 2014-12-18T07:01:57.123 回答
0

pd.concat seems to run out of memory for large dataframes as well, one option is to convert the dfs to matrixes and concat these.

def concat_df_by_np(df1,df2):
    """
    accepts two dataframes, converts each to a matrix, concats them horizontally and
    uses the index of the first dataframe. This is not a concat by index but simply by
    position, therefore the index of both dataframes should be the same
    """
    dfout = deepcopy(pd.DataFrame(np.concatenate( (df1.as_matrix(),df2.as_matrix()),axis=1),
                                  index   = df1.index, 
                                  columns = np.concatenate([df1.columns,df2.columns])))
    if (df1.index!=df2.index).any():
       #logging.warning('Indices in concat_df_by_np are not the same')                     
       print ('Indices in concat_df_by_np are not the same')                     


    return dfout

However, one needs to be careful as this function is not a join but rather a horizontal append while where the indices are ignored

于 2017-03-30T13:33:14.990 回答