python - 将数据框拆分为相应命名的数组或系列（然后重新组合）

Question

假设我有一个包含 x 和 y 列的数据框。我想自动将其拆分为与列同名的数组（或系列），处理数据，然后再重新加入它们。手动执行此操作非常简单：

x, y = df.x, df.y
z = x + y   # in actual use case, there are hundreds of lines like this
df = pd.concat([x,y,z],axis=1)

但我想自动化这个。使用 df.columns 很容易获得字符串列表，但我真的想要 [x,y] 而不是 ['x','y']。到目前为止，我能做的最好的就是与 exec 一起解决这个问题：

df_orig = DataFrame({ 'x':range(1000), 'y':range(1000,2000),  'z':np.zeros(1000) })

def method1( df ):

   for col in df.columns:
      exec( col + ' = df.' + col + '.values')

   z = x + y   # in actual use case, there are hundreds of lines like this

   for col in df.columns:   
      exec( 'df.' + col + '=' + col )

df = df_orig.copy() 
method1( df )         # df appears to be view of global df, no need to return it
df1 = df

所以有2个问题：

1）像这样使用 exec 通常不是一个好主意（当我尝试将它与 numba 结合时已经给我带来了问题）——或者那很糟糕？它似乎适用于系列和数组。

2）我不确定利用这里的观点的最佳方式。理想情况下，我在这里真正想做的就是使用 x 作为 df.x 的视图。我假设在 x 是一个数组的情况下这是不可能的，但如果 x 是一个系列，也许是这样？

上面的示例适用于数组，但理想情况下，我正在寻找一种也适用于系列的解决方案。取而代之的是，当然欢迎与其中一个或另一个一起使用的解决方案。

动机：

1）可读性，这可以通过eval部分实现，但我不相信eval可以用于多行？

2）对于像z = x + y这样的多行，这种方法对于系列（我尝试过的示例中的2x或3x）要快一些，对于数组（超过10x）甚至更快。见这里：数字处理二维数组的最快方法：dataframe vs series vs array vs numba

score 1 · Accepted Answer

这并不完全符合您的要求，而是另一种思考方式。

这里有一个要点，它定义了一个上下文管理器，允许您像引用本地列一样引用列。这不是我写的，它有点旧，但似乎仍然适用于当前版本的 pandas。

In [45]: df = pd.DataFrame({'x': np.random.randn(100000), 'y': np.random.randn(100000)})

In [46]: with DataFrameContextManager(df):
    ...:     z = x + y
    ...:     

In [47]: z.head()
Out[47]: 
0   -0.821079
1    0.035018
2    1.180576
3   -0.155916
4   -2.253515
dtype: float64

score 1 · Accepted Answer

只需使用索引符号和字典，而不是属性符号。

df_orig = DataFrame({ 'x':range(1000), 'y':range(1000,2000),  'z':np.zeros(1000) })

def method1( df ):

   series = {}
   for col in df.columns:
      series[col] = df[col]

   series['z'] = series['x'] + series['y']   # in actual use case, there are hundreds of lines like this

   for col in df.columns:   
      df[col] = series[col]

df = df_orig.copy() 
method1( df )         # df appears to be view of global df, no need to return it
df1 = df

python - 将数据框拆分为相应命名的数组或系列（然后重新组合）

2 回答 2

Related

Reference