python - 替代循环遍历熊猫数据框行以应用条件？

Question

我有一个数据框，我想根据某些条件进行修改。实际的数据框（35k 行，400 列）比下面的示例大得多，有更多的患者列。

如果在给定行的患者列下有 2 个 NaN，我想删除整行。接下来，我想为数据框附加一列，该列包含每行所有患者值的 df.std()。我读到不建议遍历 pandas 数据框，但我很难为此目的使用 numpy。

输入：

In [1]: df=pd.DataFrame({'chromosome':[1,1,5,4], 
   ...:                  'strand':['-','-','+','-'], 
   ...:                  'elementloc':[4991, 8870, 2703, 9674], 
   ...:                  'Patient1_Psi': ['NaN', 0.25,0.63,0.92], 
   ...:                  'Patient2_Psi':[0.11, 0.45, 'NaN', 1.0], 
   ...:                  'Patient3_Psi':['NaN', 0.1, 'NaN', 0.4]}) 
   ...: df  

                                                                

Out[2]: 
   chromosome strand  elementloc Patient1_Psi Patient2_Psi Patient3_Psi
0           1      -        4991          NaN         0.11          NaN
1           1      -        8870         0.25         0.45          0.1
2           5      +        2703         0.63          NaN          NaN
3           4      -        9674         0.92            1          0.4

我想要的输出：

In [3]: df_new=pd.DataFrame({'chromosome':[1,4], 
   ...:                  'strand':['-','-'], 
   ...:                  'elementloc':[ 8870, 9674], 
   ...:                  'Patient1_Psi': [0.25,0.92], 
   ...:                  'Patient2_Psi':[0.45, 1.0], 
   ...:                  'Patient3_Psi':[0.1, 0.4], 
   ...:                   'std':[0.175594, 0.325781]}) 
   ...: df_new                                                                 


Out[4]: 
   chromosome strand  elementloc  Patient1_Psi  Patient2_Psi  Patient3_Psi       std
0           1      -        8870          0.25          0.45           0.1  0.175594
1           4      -        9674          0.92          1.00           0.4  0.325781

建议？

score 1 · Accepted Answer

您可以这样做，使用filter匹配模式的列过滤：

df = df.replace('NaN', np.nan)
df_new = df[~df.filter(like='Patient').isna().any(axis=1)]
pd.concat([df_new, df_new.filter(like='Patient').std(axis=1).rename('std')], axis=1)

输出：

   chromosome strand  elementloc  Patient1_Psi  Patient2_Psi  Patient3_Psi       std
1           1      -        8870          0.25          0.45           0.1  0.175594
3           4      -        9674          0.92          1.00           0.4  0.325781

score 1 · Accepted Answer

您只需将您的要求翻译成 Pandas 语言，就可以在一行中完成：

df[(df.loc[:, 'Patient1_Psi':] == 'NaN').sum(axis=1) < 2]

它按预期给出：

   chromosome strand  elementloc Patient1_Psi Patient2_Psi Patient3_Psi
1           1      -        8870         0.25         0.45          0.1
3           4      -        9674         0.92            1          0.4

顺便说一句，如果你有真正的 NaN 值而不是它们的字符串表示，你会使用

df[df.loc[:, 'Patient1_Psi':].isna().sum(axis=1) < 2]

score 0 · Accepted Answer

您可以dropna与subset列名列表一起使用。即根据子集列考虑哪些行需要删除：

df.columns.difference将选择未在列列表中传递的剩余列。

df.replace('NaN', np.nan, inplace=True)
df.dropna(subset=['Patient1_Psi', 'Patient2_Psi','Patient3_Psi'], axis=0, inplace=True)
df["std"] = df[df.columns.difference(['chromosome','strand', 'elementloc'])].std(axis=1)
print(df)

python - 替代循环遍历熊猫数据框行以应用条件？

3 回答 3

Related

Reference