python - 在大型 DataFrame 上使用 pandas 进行排列的有效方法

Question

目前我有一个这样的熊猫数据框：

 ID                    A1      A2       A3       B1       B2       B3
 Ku8QhfS0n_hIOABXuE    6.343   6.304    6.410    6.287    6.403    6.279
 fqPEquJRRlSVSfL.8A    6.752   6.681    6.680    6.677    6.525    6.739
 ckiehnugOno9d7vf1Q    6.297   6.248    6.524    6.382    6.316    6.453
 x57Vw5B5Fbt5JUnQkI    6.268   6.451    6.379    6.371    6.458    6.333

此 DataFrame 与统计信息一起使用，然后需要进行排列测试（编辑：准确地说，是随机排列）。每列的索引需要洗牌（采样）100 次。为了给出大小的概念，行数可以是 50,000 左右。

编辑：排列是沿着行，即洗牌每列的索引。

这里最大的问题是性能之一。我想以快速的方式排列事物。

我想到的一个例子是：

import random
import joblib

def permutation(dataframe):
    return dataframe.apply(random.sample, axis=1, k=len(dataframe))

permute = joblib.delayed(permutation)
pool = joblib.Parallel(n_jobs=-2) # all cores minus 1
result = pool(permute(dataframe) for item in range(100))

这里的问题是，通过这样做，测试不稳定：显然排列有效，但它不像没有并行完成的那样“随机”，因此当我使用时结果会失去稳定性后续计算中的置换数据。

所以我唯一的“解决方案”是在执行并行代码之前预先计算所有列的所有索引，这会大大减慢速度。

我的问题是：

有没有更有效的方法来进行这种排列？（不一定平行）
并行方法（使用多个进程，而不是线程）是否可行？

编辑：为了让事情更清楚，这是在一次洗牌后应该发生的情况，例如 A1 列：

Ku8QhfS0n_hIOABXuE    6.268   
fqPEquJRRlSVSfL.8A    6.343
ckiehnugOno9d7vf1Q    6.752
x57Vw5B5Fbt5JUnQk     6.297

（即行值正在移动）。

EDIT2：这是我现在使用的：

def _generate_indices(indices, columns, nperm):

    random.seed(1234567890)
    num_genes = indices.size

    for item in range(nperm):

        permuted = pandas.DataFrame(
            {column: random.sample(genes, num_genes) for column in columns},
             index=range(genes.size)
        )

        yield permuted

（简而言之，为每列构建一个重采样索引的 DataFrame）

后来（是的，我知道这很丑）：

 # Data is the original DataFrame
 # Indices one of the results of that generator

 permuted = dict()

 for column in data.columns:

    value = data[column]
    permuted[column] = value[indices[column].values].values

 permuted_table = pandas.DataFrame(permuted, index=data.index)

score 1 · Accepted Answer

这个怎么样：

In [1]: import numpy as np; import pandas as pd

In [2]: df = pd.DataFrame(np.random.randn(50000, 10))

In [3]: def shuffle(df, n):
   ....:     for i in n:
   ....:         np.random.shuffle(df.values)
   ....:     return df


In [4]: df.head()
Out[4]:
          0         1         2         3         4         5         6         7         8         9
0  0.329588 -0.513814 -1.267923  0.691889 -0.319635 -1.468145 -0.441789  0.004142 -0.362073 -0.555779
1  0.495670  2.460727  1.174324  1.115692  1.214057 -0.843138  0.217075  0.495385  1.568166  0.252299
2 -0.898075  0.994281 -0.281349 -0.104684 -1.686646  0.651502 -1.466679 -1.256705  1.354484  0.626840
3  1.158388 -1.227794 -0.462005 -1.790205  0.399956 -1.631035 -1.707944 -1.126572 -0.892759  1.396455
4 -0.049915  0.006599 -1.099983  0.775028 -0.694906 -1.376802 -0.152225  1.413212  0.050213 -0.209760

In [5]: shuffle(df, 1).head(5)
Out[5]:
          0         1         2         3         4         5         6         7         8         9
0  2.044131  0.072214 -0.304449  0.201148  1.462055  0.538476 -0.059249 -0.133299  2.925301  0.529678
1  0.036957  0.214003 -1.042905 -0.029864  1.616543  0.840719  0.104798 -0.766586 -0.723782 -0.088239
2 -0.025621  0.657951  1.132175 -0.815403  0.548210 -0.029291  0.575587  0.032481 -0.261873  0.010381
3  1.396024  0.859455 -1.514801  0.353378  1.790324  0.286164 -0.765518  1.363027 -0.868599 -0.082818
4 -0.026649 -0.090119 -2.289810 -0.701342 -0.116262 -0.674597 -0.580760 -0.895089 -0.663331  0.

In [6]: %timeit shuffle(df, 100)
Out[6]:
1 loops, best of 3: 14.4 s per loop

这可以满足您的需要。唯一的问题是它是否足够快。

更新

根据@Einar 的评论，我改变了我的解决方案。

In[7]: def shuffle2(df, n):
           ind = df.index
           for i in range(n):
               sampler = np.random.permutation(df.shape[0])
               new_vals = df.take(sampler).values
               df = pd.DataFrame(new_vals, index=ind)
           return df

In [8]: df.head()
Out[8]: 
          0         1         2         3         4         5         6         7         8         9
0 -0.175006 -0.462306  0.565517 -0.309398  1.100570  0.656627  1.207535 -0.221079 -0.933068 -0.192759
1  0.388165  0.155480 -0.015188  0.868497  1.102662 -0.571818 -0.994005  0.600943  2.205520 -0.294121
2  0.281605 -1.637529  2.238149  0.987409 -1.979691 -0.040130  1.121140  1.190092 -0.118919  0.790367
3  1.054509  0.395444  1.239756 -0.439000  0.146727 -1.705972  0.627053 -0.547096 -0.818094 -0.056983
4  0.209031 -0.233167 -1.900261 -0.678022 -0.064092 -1.562976 -1.516468  0.512461  1.058758 -0.206019

In [9]: shuffle2(df, 1).head()
Out[9]: 
          0         1         2         3         4         5         6         7         8         9
0  0.054355  0.129432 -0.805284 -1.713622 -0.610555 -0.874039 -0.840880  0.593901  0.182513 -1.981521
1  0.624562  1.097495 -0.428710 -0.133220  0.675428  0.892044  0.752593 -0.702470  0.272386 -0.193440
2  0.763551 -0.505923  0.206675  0.561456  0.441514 -0.743498 -1.462773 -0.061210 -0.435449 -2.677681
3  1.149586 -0.003552  2.496176 -0.089767  0.246546 -1.333184  0.524872 -0.527519  0.492978 -0.829365
4 -1.893188  0.728737  0.361983 -0.188709 -0.809291  2.093554  0.396242  0.402482  1.884082  1.373781

In [10]: timeit shuffle2(df, 100)
1 loops, best of 3: 2.47 s per loop

python - 在大型 DataFrame 上使用 pandas 进行排列的有效方法

1 回答 1

更新

Related

Reference