python - Patsy 和 Pandas 之间的标准化结果不同 - Python

Question

我发现了一个有趣的问题，我很想听听你的解释。

from patsy import dmatrix,demo_data
df = pd.DataFrame(demo_data("a", "b", "x1", "x2", "y", "z column"))

Patsy_Standarlize_Output = dmatrix("standardize(x2) + 0",df).ravel()
output = (df['x2'] - df['x2'].mean()) / df['x2'].std()
Pandas_Standarlize_Output = output.ravel()

如果您打印出标准化x2 列的结果，您会发现结果完全不同。结果如下：

Patsy_Standarlize_Output = [-1.21701061，-0.07791372，-0.66884723，2.23584028，0.69898536，-0.71843674，-0.00416815，-0.2484492]

Pandas_Standarlize_Output = [-1.13840918, -0.07288161, -0.62564929, 2.09143707, 0.65384094, -0.67203603, -0.00389895, -0.23240294]

我的问题是，既然我对同一列进行了标准化，为什么结果不同？

我期待听到您的精彩诠释！非常感谢您的时间和帮助！

score 1 · Accepted Answer

pandasstd()执行Bessel 校正，而大多数其他库不执行。一旦你有几十个点，这实际上并不重要，但对于小样本来说，这是一件非常合理的事情。

证明：如果你df['x2'].std()用 numpy version ( df['x2'].values.std()) 替换，结果会匹配

python - Patsy 和 Pandas 之间的标准化结果不同 - Python

1 回答 1

Related

Reference