scale
大熊猫中R函数的有效等价物是什么?例如
newdf <- scale(df)
用熊猫写的?有没有优雅的使用方式transform
?
缩放在机器学习任务中很常见,因此在 scikit-learn 的preprocessing
模块中实现。您可以将 pandas DataFrame 传递给它的scale
方法。
唯一的“问题”是返回的对象不再是DataFrame,而是numpy数组;如果您想将其传递给机器学习模型(例如 SVM 或逻辑回归),这通常不是真正的问题。如果要保留 DataFrame,则需要一些解决方法:
from sklearn.preprocessing import scale
from pandas import DataFrame
newdf = DataFrame(scale(df), index=df.index, columns=df.columns)
另请参见此处。
I don't know R, but from reading the documentation it looks like the following would do the trick (albeit in a slightly less general way)
def scale(y, c=True, sc=True):
x = y.copy()
if c:
x -= x.mean()
if sc and c:
x /= x.std()
elif sc:
x /= np.sqrt(x.pow(2).sum().div(x.count() - 1))
return x
For the more general version you'd probably need to do some type/length checking.
EDIT: Added explanation of the denominator in elif sc:
clause
From the R docs:
... If ‘scale’ is
‘TRUE’ then scaling is done by dividing the (centered) columns of
‘x’ by their standard deviations if ‘center’ is ‘TRUE’, and the
root mean square otherwise. If ‘scale’ is ‘FALSE’, no scaling is
done.
The root-mean-square for a (possibly centered) column is defined
as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing
values and n is the number of non-missing values. In the case
‘center = TRUE’, this is the same as the standard deviation, but
in general it is not.
The line np.sqrt(x.pow(2).sum().div(x.count() - 1))
computes the root mean square using the definition by first squaring x
(the pow
method) then summing along the rows and then dividing by the non NaN
counts in each column (the count
method).
As a side the note the reason I didn't just simply compute the RMS after centering is because the std
method calls bottleneck
for faster computation of that expression in that special case where you want to compute the standard deviation and not the more general RMS.
You could instead compute the RMS after centering, might be worth a benchmark since now that I'm writing this I'm not actually sure which is faster and I haven't benchmarked it.