pandas - 将 xarray 数据集（explode + combine_by_coords）与仅共享坐标子集的变量组合的有效方法

Question

语境

我正在运行一个生成多个（networkx）图的模拟。（它实际上是一个基于台面代理的模拟，灵感来自网络上的病毒示例）并且由于随机性，我多次运行每组参数。

最小的例子

以下示例希望能够让您了解我想要实现的目标

import numpy as np
import pandas as pd
import networkx as nx

def avg_degree(G):
    return 2 * G.number_of_edges() / G.number_of_nodes()

def degree_distribution(G):
    return np.array(nx.degree_histogram(G))

def network_metrics(G):
    return {
        "avg degree": avg_degree(G),
        "degree distribution": degree_distribution(G),
    }

def generate_data(step, run, n, p):
    G = nx.erdos_renyi_graph(n, p)
    dct = {
        'network_type': 'random',
        'run': run,
        'nb_agents': n,
        'probability': p,
        'infected': np.random.randint(0,100, step),
    }
    dct.update(network_metrics(G))
    return dct


def main(nb_steps):
    lst = []
    for run in range(4):
        for nb_nodes in [10, 20, 50]:
            for probability in [0.1, 0.5, 0.8]:
                dct = generate_data(nb_steps, run, nb_nodes, probability)
                lst.append(dct)

    result = pd.DataFrame(lst)
    indexes = ['network_type', 'run', 'nb_agents', 'probability']
    result.set_index(indexes, inplace=True, drop=True)
    return result

这使

result = main(10)
result.head()

网络类型	nb_agents	可能性	已感染	avg_degree	degree_distribution
随机的	10	0.1	[73 86 96 94 33 57 36 15 30 74]	0.8	[5 3 1 1]
随机的	10	0.5	[ 4 0 64 37 40 16 30 67 51 36]	4.2	[0 0 0 4 2 2 2]
随机的	10	0.8	[59 96 51 68 81 11 40 31 26 95]	7.2	[0 0 0 0 0 0 1 6 3]
随机的	20	0.1	[17 91 26 32 63 65 79 28 80 32]	1.8	[3 4 9 2 2]
随机的	20	0.5	[17 2 5 17 85 13 42 77 70 72]	9	[0 0 0 0 0 0 2 2 4 3 6 1 2]

目标

爆炸infected和degree distribution（！！注意：infected有一个固定的长度（= nb_steps）但不是degree distribution（其长度变化））。
将所有内容合并到一个数据集中

当前解决方案

我使用以下辅助函数来分解不同的列

def explode(df, mapping):
    new_df = df.reset_index()
    for col, idx in mapping.items():
        new_df.index.rename('_id', inplace=True)
        new_df = new_df.explode(col)
        new_df.insert(1, idx, new_df.groupby('_id').cumcount())
        new_df.reset_index(drop=True, inplace=True)

    idx = list(df.index.names)
    nested_idx = list(mapping.values())
    return new_df.set_index(idx + nested_idx)


def helper(df, mapping):
    sol = []
    for k,v in mapping.items():
        sol.append(explode(df[k], {k:v}).to_xarray())

    leftover_columns = list(df.columns.difference(mapping.keys()))
    sol.append(df[leftover_columns].to_xarray())
    return sol

我使用如下：

mapping = {'infected': 'step', 'degree distribution': "nb_nodes_with_degree"}
lst = helper(result, mapping)
xr.combine_by_coords(lst)

结果

>>> xr.combine_by_coords(lst)  # final result

<xarray.Dataset>
Dimensions:               (network_type: 1, run: 4, nb_agents: 3, probability: 3, nb_nodes_with_degree: 47, step: 10)
Coordinates:
  * network_type          (network_type) object 'random'
  * run                   (run) int64 0 1 2 3
  * nb_agents             (nb_agents) int64 10 20 50
  * probability           (probability) float64 0.1 0.5 0.8
  * nb_nodes_with_degree  (nb_nodes_with_degree) int64 0 1 2 3 4 ... 43 44 45 46
  * step                  (step) int64 0 1 2 3 4 5 6 7 8 9
Data variables:
    avg degree            (network_type, run, nb_agents, probability) float64 ...
    degree distribution   (network_type, run, nb_agents, probability, nb_nodes_with_degree) object ...
    infected              (network_type, run, nb_agents, probability, step) object ...

限制

它可以工作，但速度很慢而且非常不雅。--> 有什么更好的处理方式？

pandas - 将 xarray 数据集（explode + combine_by_coords）与仅共享坐标子集的变量组合的有效方法

语境

最小的例子

目标

当前解决方案

结果

限制

0 回答 0

Related

Reference