1

我有两组数据,其中 X 是观察值,Y 是预期值。我正在尝试量化与 Python 的拟合优度。人们经常计算数据集并根据这些值决定哪个更好,哪个是错误的。我想要帮助我确定哪个数据集观察到的值接近预期值的值。我尝试使用 Python 进行测试,但是否有任何其他测试可以帮助确定最适合的测试。

代码

from scipy.stats import chisquare
import numpy as np

x1 = np.array([97.83, 95.06, 92.54, 97.69, 93.76, 93.36, 93.37, 99.29, 101.57, 
        97.88, 98.71, 75.31, 72.52, 67.75, 77.97, 78.42, 72.62, 82.29, 90.26, 76.32, 78.78, 79.96])
y1 = np.array([90.90, 90.50, 89.50, 92.90, 91.20, 91.70, 91.40, 94.20, 96.80,
        93.30, 94.40, 70.20, 71.20, 68.40, 74.20, 74.60, 72.00, 77.80, 83.00, 73.50, 76.70, 82.60])


x2 = ([92.14, 91.44, 91.31, 93.26, 93.26, 91.65, 92.41, 93.47, 97.12, 101.46, 
        94.99, 98.08, 69.33, 69.63, 68.45, 72.62, 71.17, 80.54, 90.42, 74.25, 79.60, 80.77])
y2 = ([90.90, 90.50, 89.50, 92.90, 93.00, 91.20, 91.70, 91.40, 94.20, 96.80, 93.30, 
        94.40, 70.20, 71.20, 68.40, 74.20, 72.00, 77.80, 83.00, 73.50, 76.70, 82.60])

print chisquare(x1, y1)
print chisquare(x2, y2)

更新

from scipy.stats import chisquare
from sklearn.metrics import r2_score
from scipy import stats
import numpy as np

x1 = np.array([97.83, 95.06, 92.54, 97.69, 93.76, 93.36, 93.37, 99.29, 101.57, 
        97.88, 98.71, 75.31, 72.52, 67.75, 77.97, 78.42, 72.62, 82.29, 90.26, 76.32, 78.78, 79.96])
y1 = np.array([90.90, 90.50, 89.50, 92.90, 91.20, 91.70, 91.40, 94.20, 96.80,
        93.30, 94.40, 70.20, 71.20, 68.40, 74.20, 74.60, 72.00, 77.80, 83.00, 73.50, 76.70, 82.60])


x2 = ([92.14, 91.44, 91.31, 93.26, 93.26, 91.65, 92.41, 93.47, 97.12, 101.46, 
        94.99, 98.08, 69.33, 69.63, 68.45, 72.62, 71.17, 80.54, 90.42, 74.25, 79.60, 80.77])
y2 = ([90.90, 90.50, 89.50, 92.90, 93.00, 91.20, 91.70, 91.40, 94.20, 96.80, 93.30, 
        94.40, 70.20, 71.20, 68.40, 74.20, 72.00, 77.80, 83.00, 73.50, 76.70, 82.60])


print "Scikit R2, 1:", r2_score(y1, x1)
print "Scikit R2, 2:", r2_score(y2, x2)


slope1, intercept1, r_value1, p_value1, std_err1 = stats.linregress(y1,x1)
slope2, intercept2, r_value2, p_value2, std_err2 = stats.linregress(y2,x2)


print "Stats R2, 1:", r_value1**2
print "Stats R2, 2", r_value2**2

使用更新后的代码,获得以下输出:

Scikit R2, 1: 0.820091025592
Scikit R2, 2: 0.928643087517
Stats R2, 1: 0.958813342741
Stats R2, 2 0.965013525387

为什么从 scikit 和 scipy 获得的 R2 值不同?

4

1 回答 1

2

您列出的两个函数 (scipy.stats.linregresssklearn.metrics.r2_score) 做不同的事情。

sklearn.metrics.r2_score

sklearn.metrics.r2_score做你正在寻找的东西:它需要两组数据,并计算这两组数据R^2之间的(确定系数)。从文档:

sklearn.metrics。r2_score (y_true, y_pred, sample_weight=None, multioutput=None)

参数:

y_true : 类似数组的形状 = (n_samples) 或 (n_samples, n_outputs)

基本事实(正确)目标值。

y_pred : 类似数组的形状 = (n_samples) 或 (n_samples, n_outputs)

估计的目标值。

因此,您观察到的数据 ( x1,x2) 是您的y_true,而您的预期值 ( y1,y2) 是您的y_pred。所以,这是正确的称呼方式:

r2_score(x1, y1)

scipy.stats.linregress

scipy.stats.linregress不做你正在寻找的东西。其目的是执行线性回归并找到两组数据(不是一组数据及其预测值)的拟合。它r_value返回(您可以平方得到 R^2,是您提供给它的值与它执行的回归(拟合)的预测值之间的相关系数y。由于您已经知道您的预测值,这不是您正在寻找的函数为了。

于 2016-02-11T11:32:20.250 回答