我在 Docker 容器中运行一些 Python statsmodel 代码。当我在两台不同的计算机上运行此代码时(使用从 DockerHub 中提取的同一个 Docker 容器,而不是在本地构建 2x),我得到了不同的结果。差异很小 - 第 10 位或第 15 位发生变化。但它正在破坏我们可重现的构建。这是 Python statsmodel 问题吗?一个 Docker 问题?
我认为这是 Python,因为在从这些 Docker 映像生成的容器中运行着 1000 多条其他行,并且它们是可重现的。
这是一个 MWE,以及差异示例:
import numpy as np
import pandas as pd
import statsmodels
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std
np.random.seed(42)
df = pd.DataFrame(columns=['foo', 'bar'], data=np.random.random((1000, 2)))
y = (df['bar'])
X = np.log10(df['foo'])
X = sm.add_constant(X)
model = sm.OLS(y, X)
fits = model.fit()
predictions = fits.predict(X)
XX = np.linspace(X['foo'].min(), X['foo'].max(), 50)
XX = sm.add_constant(XX)
yy = fits.predict(XX)
sdev, lower, upper = wls_prediction_std(fits, exog=XX, alpha=0.05)
bad = df.loc[df['bar'] < 50,'bar']
df.loc[df['bar'] < 50,'bar'] = fits.predict(sm.add_constant(np.log10(bad)))
fits.summary()
with open("output.txt", "w") as text_file:
text_file.write(fits.summary().as_csv())
df.to_csv('out.csv', index=False)
并且差异out.csv
很小。例如,
$ sdiff <(cat out.csv) <(ssh remote_server cat out.csv) | tail
显示以下内容。请注意,只有最后一位数字发生了变化。
0.18610141784627732,0.5081884090422659 | 0.18610141784627732,0.5081884090422658
0.45818688673789265,0.5082792408801786 | 0.45818688673789265,0.5082792408801785
0.13347997241594378,0.5085994020210153 | 0.13347997241594378,0.5085994020210152
0.7279393069737652,0.5082743139146337 | 0.7279393069737652,0.5082743139146336
0.43685070261517955,0.5082054932289445 | 0.43685070261517955,0.5082054932289444
0.7655128989911097,0.5084780190581778 | 0.7655128989911097,0.5084780190581777
0.6102251494776413,0.5085067071667805 | 0.6102251494776413,0.5085067071667804
0.7513750860290457,0.5082242252400639 0.7513750860290457,0.5082242252400639
0.956614621083458,0.5086273010565618 0.956614621083458,0.5086273010565618
0.05705472115125432,0.5083753342014574 | 0.05705472115125432,0.5083753342014573