规范化 pandas DataFrame 的每一行的最惯用的方法是什么?规范化列很容易,所以一个(非常难看!)选项是:
(df.T / df.T.sum()).T
熊猫广播规则阻止df / df.sum(axis=1)
这样做
规范化 pandas DataFrame 的每一行的最惯用的方法是什么?规范化列很容易,所以一个(非常难看!)选项是:
(df.T / df.T.sum()).T
熊猫广播规则阻止df / df.sum(axis=1)
这样做
我们还可以获得底层的 numpy 数组,在轴上求和,同时保持维度和元素除法:
df / df.to_numpy().sum(axis=1, keepdims=True)
此方法比sum
轴 + 上div
的索引快约 60%:
df = pd.DataFrame(np.random.rand(1000000, 100))
%timeit -n 10 df.div(df.sum(axis=1), axis=0)
748 ms ± 14.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 df / df.to_numpy().sum(axis=1, keepdims=True)
452 ms ± 4.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
事实上,如果我们增加行数和列数,这种趋势仍然存在:
重现上述图的代码:
import perfplot
import pandas as pd
import numpy as np
def enke(df):
return df / df.to_numpy().sum(axis=1, keepdims=True)
def joris(df):
return df.div(df.sum(axis=1), axis=0)
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.rand(n, 10)),
kernels=[enke, joris],
labels=['enke', 'joris'],
n_range=[2 ** k for k in range(4, 21)],
equality_check=np.allclose,
xlabel='~len(df)',
title='For len(df)x10 DataFrames'
)
perfplot.show(
setup=lambda n: pd.DataFrame(np.random.rand(10000, n)),
kernels=[enke, joris],
labels=['enke', 'joris'],
n_range=[1.4 ** k for k in range(21)],
equality_check=np.allclose,
xlabel='~width(df)',
title='For 10_000xwidth(df) DataFrames'
)
我建议使用Scikit 预处理库并根据需要转置您的数据框:
'''
Created on 05/11/2015
@author: rafaelcastillo
'''
import matplotlib.pyplot as plt
import pandas
import random
import numpy as np
from sklearn import preprocessing
def create_cos(number_graphs,length,amp):
# This function is used to generate cos-kind graphs for testing
# number_graphs: to plot
# length: number of points included in the x axis
# amp: Y domain modifications to draw different shapes
x = np.arange(length)
amp = np.pi*amp
xx = np.linspace(np.pi*0.3*amp, -np.pi*0.3*amp, length)
for i in range(number_graphs):
iterable = (2*np.cos(x) + random.random()*0.1 for x in xx)
y = np.fromiter(iterable, np.float)
if i == 0:
yfinal = y
continue
yfinal = np.vstack((yfinal,y))
return x,yfinal
x,y = create_cos(70,24,3)
data = pandas.DataFrame(y)
x_values = data.columns.values
num_rows = data.shape[0]
fig, ax = plt.subplots()
for i in range(num_rows):
ax.plot(x_values, data.iloc[i])
ax.set_title('Raw data')
plt.show()
std_scale = preprocessing.MinMaxScaler().fit(data.transpose())
df_std = std_scale.transform(data.transpose())
data = pandas.DataFrame(np.transpose(df_std))
fig, ax = plt.subplots()
for i in range(num_rows):
ax.plot(x_values, data.iloc[i])
ax.set_title('Data Normalized')
plt.show()