python - 从多个文件创建熊猫数据框

Question

我正在尝试创建一个熊猫DataFrame，它适用于单个文件。如果我需要为具有相同数据结构的多个文件构建它。因此，我有一个文件名列表，而不是单个文件名，我想从中创建DataFrame.

不确定DataFrame在 pandas 中追加到 current 的方法是什么，或者 pandas 有没有办法将文件列表吸入DataFrame.

score 36 · Accepted Answer

pandasconcat命令是你的朋友。假设您将所有文件都放在一个目录 targetdir 中。你可以：

列出文件
将它们加载为熊猫数据框
并将它们连接在一起

`

import os
import pandas as pd

#list the files
filelist = os.listdir(targetdir) 
#read them into pandas
df_list = [pd.read_table(file) for file in filelist]
#concatenate them together
big_df = pd.concat(df_list)

score 3 · Accepted Answer

可能非常低效，但......

为什么不使用read_csv, 构建两个（或更多）数据框，然后使用 join 将它们放在一起？

也就是说，如果您提供一些数据或您迄今为止使用的一些代码，那么回答您的问题会更容易。

score 1 · Accepted Answer

我可能会尝试在将文件提供给熊猫之前将它们连接起来。如果您在 Linux 或 Mac 中，您可以使用cat，否则一个非常简单的 Python 函数可以为您完成这项工作。

score 0 · Accepted Answer

这些文件是否为 csv 格式。您可以使用 read_csv。 http://pandas.sourceforge.net/io.html

读取文件并将其保存在两个数据帧中后，您可以合并两个数据帧或向两个数据帧之一添加其他列（假设公共索引）。熊猫应该能够填补缺失的行。

score 0 · Accepted Answer

import os
import pandas as pd
data = []

thisdir = os.getcwd()

for r, d, f in os.walk(thisdir):
    for file in f:
        if ".docx" in file:
            data.append(file)

df = pd.DataFrame(data)

score 0 · Accepted Answer

这是一个简单的解决方案，它避免使用列表来保存所有数据框，如果您不需要它们在列表中，它会为每个文件创建一个数据框，然后您可以使用pd.concat它们。

import fnmatch

# get the CSV files only
files = fnmatch.filter(os.listdir('.'), '*.csv')
files

输出现在是名称列表：

['Feedback Form Submissions 1.21-1.25.22.csv',
 'Feedback Form Submissions 1.21.22.csv',
 'Feedback Form Submissions 1.25-1.31.22.csv']

现在创建一个简单的新名称列表，以便更轻松地使用它们：

# use a simple format
names = []
for i in range(0,len(files)):
    names.append('data' + str(i))
names

['data0', 'data1', 'data2']

您可以使用所需的任何名称列表。下一步获取文件名和名称列表，然后将它们分配给名称。

# i is the incrementor for the list of names
i = 0

# iterate through the file names
for file in files:
    # make an empty dataframe
    df = pd.DataFrame()
    # load the first file in
    df = pd.read_csv(file, low_memory=False)
    # get the first name from the list, this will be a string
    new_name = names[i]
    # assign the string to the variable and assign it to the dataframe 
    locals()[new_name] = df.copy()
    # increment the list of names
    i = i + 1

您现在有 3 个单独的数据帧，分别命名为 data0、data1、data2，并执行如下命令

data2.info()

python - 从多个文件创建熊猫数据框

6 回答 6

Related

Reference