python - python pandas 初学者：多维数据分析工作流程（groupby+agg+plot）

Question

我是 pandas 的新手，我尝试学习如何处理我的多维数据。

我的数据

假设，我的数据是列 ['A'、'B'、'C'、'D'、'E'、'F'、'G'] 的大 CSV。该数据描述了一些模拟结果，其中 ['A', 'B', ..., 'F'] 是模拟参数，'G' 是输出之一（本示例中只有现有输出！）。

编辑/更新： 正如布德在评论中建议的那样，让我们生成一些与我的兼容的数据：

import pandas as pd
import itertools
import numpy as np

npData = np.zeros(5000, dtype=[('A','i4'),('B','f4'),('C','i4'), ('D', 'i4'), ('E', 'f4'), ('F', 'i4'), ('G', 'f4')])

A = [0,1,2,3,6] # param A: int
B = [1000.0, 10.000] # param B: float
C = [100,150,200,250,300] # param C: int
D = [10,15,20,25,30] # param D: int
E = [0.1, 0.3] # param E: float
F = [0,1,2,3,4,5,6,7,8,9] # param F = random-seed = int -> 10 runs per scenario

# some beta-distribution parameters for randomizing the results in column "G"
aDistParams = [ (6,1),
                (5,2),
                (4,3),
                (3,4),
                (2,5),
                (1,6),
                (1,7) ]

counter = 0
for i in itertools.product(A,B,C,D,E,F):
    npData[counter]['A'] = i[0]
    npData[counter]['B'] = i[1]
    npData[counter]['C'] = i[2]
    npData[counter]['D'] = i[3]
    npData[counter]['E'] = i[4]
    npData[counter]['F'] = i[5]

    np.random.seed = i[5]
    npData[counter]['G'] = np.random.beta(a=aDistParams[i[0]][0], b=aDistParams[i[0]][1])
    counter += 1

data = pd.DataFrame(npData)
data = data.reindex(np.random.permutation(data.index)) # shuffle rows because my original data doesn't give any guarantees

因为参数 ['A', 'B', ..., 'F'] 是作为笛卡尔积生成的（意思是：嵌套的 for 循环；先验），我想使用 groupby 来获得每个可能的“模拟”场景”，然后再分析输出。

参数 'F' 描述了每个场景的多次运行（每个场景由 'A', 'B', ..., 'E' 定义；让我们假设 'F' 是随机种子），所以我的代码变成：

grouped = data.groupby(['A','B','C','D','E'])
# -> every group defines one simulation scenario

grouped_agg = grouped.agg(({'G' : np.mean}))
# -> the mean of the simulation output in 'G' over 'F' is calculated for each group/scenario

我现在想做什么？

I：显示这些组中每个场景参数的所有（唯一）值 -> grouped_agg 给了我一个可迭代的元组，例如，每个位置 0 的所有条目都给了我“A”的所有值（所以用几行python我会得到我唯一的值，但也许有一个函数）
- 更新：我的方法
- list(set(grouped_agg.index.get_level_values('A')))（当对“A”感兴趣时；使用 set 获取唯一值；如果您需要高性能，可能不是您想做的事情）
- =>[0, 1, 2, 3, 6]
II：生成一些图（低维）->我需要在绘图之前使一些变量保持不变并过滤/选择我的数据（因此我需要步骤）=>
- 'B' 常量
- 'C'，常量
- 'E' 常量
- 'D' = x 轴
- 'G' = y 轴 / 我的聚合输出
- 'A' = 多维 = 2d 图中的多种颜色 -> 'A' 的每个值都有一个 G/y 轴
我将如何生成这样的情节？

我认为，重塑我的数据是关键步骤，pandas 绘图功能将处理它。也许实现一个形状，其中有 5 列（参数 A 的每个值一个）和每个索引选择 + 参数 A 选择的相应 G 值就足够了，但我还不能实现那种形式.

感谢您的输入！

（我在 enthought 的树冠中使用 pandas 0.12）

萨沙

score 2 · Accepted Answer

我：如果我理解您的示例和所需的输出，我不明白为什么需要分组。

data.A.unique()

二：更新……

我将实现您在上面绘制的示例。假设我们在随机种子（'F'）上平均了'G'，如下所示：

data = data.groupby(['A','B','C','D','E']).agg(({'G' : np.mean})).reset_index()

首先选择 B、C 和 E 具有您指定的一些常数值的行。

df1 = data[(data['B'] == const1) & (data['C'] == const2) & (data['E'] == const3)]

现在我们要将“G”绘制为“D”的函数，并为“A”的每个值使用不同的颜色。

df1.set_index('D').groupby('A')['G'].plot(legend=True)

I tested the above on some dummy data, and it works as you describe. The range of 'G' corresponding to each 'A' are plotting in the distinct color on the same axes.

III: I don't know how to answer that broad question.

IV: No, I don't think that's an issue for you here.

I suggest playing with simpler, small data sets and getting more familiar with pandas.

python - python pandas 初学者：多维数据分析工作流程（groupby+agg+plot）

我的数据

我现在想做什么？

1 回答 1

Related

Reference