9

scale大熊猫中R函数的有效等价物是什么?例如

newdf <- scale(df)

用熊猫写的?有没有优雅的使用方式transform

4

2 回答 2

12

缩放在机器学习任务中很常见,因此在 scikit-learn 的preprocessing模块中实现。您可以将 pandas DataFrame 传递给它的scale方法。

唯一的“问题”是返回的对象不再是DataFrame,而是numpy数组;如果您想将其传递给机器学习模型(例如 SVM 或逻辑回归),这通常不是真正的问题。如果要保留 DataFrame,则需要一些解决方法:

from sklearn.preprocessing import scale
from pandas import DataFrame

newdf = DataFrame(scale(df), index=df.index, columns=df.columns)

另请参见此处

于 2013-08-02T12:38:05.223 回答
8

I don't know R, but from reading the documentation it looks like the following would do the trick (albeit in a slightly less general way)

def scale(y, c=True, sc=True):
    x = y.copy()

    if c:
        x -= x.mean()
    if sc and c:
        x /= x.std()
    elif sc:
        x /= np.sqrt(x.pow(2).sum().div(x.count() - 1))
    return x

For the more general version you'd probably need to do some type/length checking.

EDIT: Added explanation of the denominator in elif sc: clause

From the R docs:

 ... If ‘scale’ is
 ‘TRUE’ then scaling is done by dividing the (centered) columns of
 ‘x’ by their standard deviations if ‘center’ is ‘TRUE’, and the
 root mean square otherwise.  If ‘scale’ is ‘FALSE’, no scaling is
 done.

 The root-mean-square for a (possibly centered) column is defined
 as sqrt(sum(x^2)/(n-1)), where x is a vector of the non-missing
 values and n is the number of non-missing values.  In the case
 ‘center = TRUE’, this is the same as the standard deviation, but
 in general it is not.

The line np.sqrt(x.pow(2).sum().div(x.count() - 1)) computes the root mean square using the definition by first squaring x (the pow method) then summing along the rows and then dividing by the non NaN counts in each column (the count method).

As a side the note the reason I didn't just simply compute the RMS after centering is because the std method calls bottleneck for faster computation of that expression in that special case where you want to compute the standard deviation and not the more general RMS.

You could instead compute the RMS after centering, might be worth a benchmark since now that I'm writing this I'm not actually sure which is faster and I haven't benchmarked it.

于 2013-08-01T22:31:10.240 回答