42

规范化 pandas DataFrame 的每一行的最惯用的方法是什么?规范化列很容易,所以一个(非常难看!)选项是:

(df.T / df.T.sum()).T

熊猫广播规则阻止df / df.sum(axis=1)这样做

4

3 回答 3

96

要克服广播问题,您可以使用以下div方法:

df.div(df.sum(axis=1), axis=0)

请参阅pandas 用户指南:匹配/广播行为

于 2013-09-03T14:15:46.687 回答
1

我们还可以获得底层的 numpy 数组,在轴上求和,同时保持维度和元素除法:

df / df.to_numpy().sum(axis=1, keepdims=True)

此方法比sum轴 + 上div的索引快约 60%:

df = pd.DataFrame(np.random.rand(1000000, 100))

%timeit -n 10 df.div(df.sum(axis=1), axis=0)
748 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit -n 10 df / df.to_numpy().sum(axis=1, keepdims=True)
452 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

事实上,如果我们增加行数和列数,这种趋势仍然存在:

在此处输入图像描述


重现上述图的代码:

import perfplot
import pandas as pd
import numpy as np

def enke(df):
    return df / df.to_numpy().sum(axis=1, keepdims=True)

def joris(df):
    return df.div(df.sum(axis=1), axis=0)

perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.rand(n, 10)), 
    kernels=[enke, joris],
    labels=['enke', 'joris'],
    n_range=[2 ** k for k in range(4, 21)],
    equality_check=np.allclose,  
    xlabel='~len(df)',
    title='For len(df)x10 DataFrames'
)

perfplot.show(
    setup=lambda n: pd.DataFrame(np.random.rand(10000, n)), 
    kernels=[enke, joris],
    labels=['enke', 'joris'],
    n_range=[1.4 ** k for k in range(21)],
    equality_check=np.allclose,  
    xlabel='~width(df)',
    title='For 10_000xwidth(df) DataFrames'
)
于 2022-02-26T11:22:26.823 回答
-3

我建议使用Scikit 预处理库并根据需要转置您的数据框:

'''
Created on 05/11/2015

@author: rafaelcastillo
'''

import matplotlib.pyplot as plt
import pandas
import random
import numpy as np
from sklearn import preprocessing

def create_cos(number_graphs,length,amp):
    # This function is used to generate cos-kind graphs for testing
    # number_graphs: to plot
    # length: number of points included in the x axis
    # amp: Y domain modifications to draw different shapes
    x = np.arange(length)
    amp = np.pi*amp
    xx = np.linspace(np.pi*0.3*amp, -np.pi*0.3*amp, length)
    for i in range(number_graphs):
        iterable = (2*np.cos(x) + random.random()*0.1 for x in xx)
        y = np.fromiter(iterable, np.float)
        if i == 0: 
            yfinal =  y
            continue
        yfinal = np.vstack((yfinal,y))
    return x,yfinal

x,y = create_cos(70,24,3)
data = pandas.DataFrame(y)

x_values = data.columns.values
num_rows = data.shape[0]

fig, ax = plt.subplots()
for i in range(num_rows):
    ax.plot(x_values, data.iloc[i])
ax.set_title('Raw data')
plt.show() 

std_scale = preprocessing.MinMaxScaler().fit(data.transpose())
df_std = std_scale.transform(data.transpose())
data = pandas.DataFrame(np.transpose(df_std))


fig, ax = plt.subplots()
for i in range(num_rows):
    ax.plot(x_values, data.iloc[i])
ax.set_title('Data Normalized')
plt.show()                                   
于 2015-11-05T12:52:39.527 回答