python - 从操作系统创建 Pandas 数据框

Question

我正在尝试从os.walk(). 这是我的文件夹结构的示例。

Top Folder1
---File1

Top Folder2
 ---File2
 ---File3
 ---File4

我想像这样制作一个DataFrame：

   Path          File_Name
0  Folder1_Path   File1
1  Folder2_Path   File2
2  Folder2_Path   File3
3  Folder2_Path   File4

我可以获得文件夹的路径和文件名，但我找不到将它们组合成数据框的方法。我尝试过concat进入append空 DataFrame 无济于事，甚至尝试创建多个Series并将它们放入单个 DataFrame 中。

import pandas as pd 
import os
import os.path

for root,dirs,files in os.walk('Y:\\', topdown=True):
    if len(files) > 0:
        print(root) #Gets the Folder Path
        print("---", files) #Creates a List of the files

如何root在 DataFrame 的一列和files另一列中获得？

score 2 · Accepted Answer

我会做这样的事情：

import os
import pandas as pd

res = []
for root, dirs, files in os.walk('Y:\\', topdown=True):
    if len(files) > 0:
        res.extend(list(zip([root]*len(files), files)))

df = pd.DataFrame(res, columns=['Path', 'File_Name']).set_index('Path')

编辑：其实我认为你不需要list()在zip. 两者都应该工作res.extend(zip([root]*len(files), files))

解释：

DataFrame 类可以接收多种类型的输入。一个很容易理解的是 a listof tuple。

每个元组的长度将是最终 DataFrame 的列数。此外，当涉及循环时，附加/扩展列表非常有效。

例子：

tuple1 = (1, 2)
tuple2 = (110, 230)
all_list = [tuple1, tuple2]
pd.DataFrame(all_list)
Out[4]: 
     0    1
0    1    2
1  110  230

您可以根据需要附加到该格式：

for i in range(100):
    all_list.append((i, i))

pd.DataFrame(all_list)
Out[19]: 
       0    1
0      1    2
1    110  230
2      0    0
3      1    1
4      2    2
5      3    3
...

因为你知道你正在传递长度为 2 的元组，所以你可以传递列名：

pd.DataFrame(all_list, columns=['path', 'file']).head()
Out[21]: 
   path  file
0     1     2
1   110   230
2     0     0
3     1     1
4     2     2

在您给我们的示例中，root 的长度始终为 1，文件可以是任意大小。使用 zip，我为根目录中的每个文件创建长度为 2 (root, file) 的元组。由于您不知道每个根目录有多少个文件，因此您可以使用[root]*len(files)调整根目录的长度以匹配文件的长度

list(zip(["a"]*len(tuple1), tuple1))
Out[6]: 
[('a', 1), ('a', 2)]

将其扩展到结果列表只需将元组添加到结果列表中。

python - 从操作系统创建 Pandas 数据框

1 回答 1

Related

Reference