1

请帮助我,因为我一直在尝试使用 SQL SERVER 2008 R2 Developers Edition 找出 CHI-SQUARED 测试。问题是查询在以下一组示例数据上运行良好:

sessionnumber   sessioncount    timespent          cnt
    1                  17               28          45
    2                  22               8           30
    3                  1                1           2
    4                  1                1           2
    5                  8               111          119
    6                  8                65          73
    7                  11               5           16
    8                  1                1           2
    9                  62               64          126
   10                  6                42          48

所以,我一直在尝试的查询是:

SELECT sessionnumber, sessioncount, timespent, expected, dev,
dev*dev/cast(expected as float) as chi_square

FROM (SELECT d3.sessionnumber, d3.sessioncount, d3.timespent,
(dim1.cnt * dim2.cnt * dim3.cnt)/cast((dimall.cnt*dimall.cnt)as float) as expected,
d3.cnt-(dim1.cnt * dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as dev FROM d3 JOIN

(SELECT sessionnumber, SUM(cast(cnt as float)) as cnt FROM d3
GROUP BY sessionnumber) dim1
ON d3.sessionnumber = dim1.sessionnumber JOIN

(SELECT sessioncount, SUM(cast(cnt as float)) as cnt FROM d3
GROUP BY sessioncount) dim2
ON d3.sessioncount = dim2.sessioncount JOIN

(SELECT timespent, SUM(cast(cnt as float)) as cnt FROM d3
GROUP BY timespent) dim3
ON d3.timespent = dim3.timespent CROSS JOIN

(SELECT SUM(cast(cnt as float)) as cnt FROM d3) dimall) a

此查询生成的结果是错误的,结果是:

sessionnumber   sessioncount    timespent          expected                              dev            chi_square
    1                  17               28          2.37921034130308E-09        44.9999999976208    851122729517.387
    2                  22               8           1.72099699796333E-10        29.9999999998279    5229526844351.02
    3                  1                1           1.3008335197251E-11         1.99999999998699    307495151323.689
    4                  1                1           1.3008335197251E-11         1.99999999998699    307495151323.689
    5                  8               111          1.90995107994937E-07        118.999999809005    74143260019.6156
    6                  8                65          5.09110109296227E-09        72.9999999949089    1046728379961.52
    7                  11               5           5.36406353430159E-11        15.9999999999464    4772501264409.71
    8                  1                1           1.3008335197251E-11         1.99999999998699    307495151323.689
    9                  62               64          6.56781317803123E-09        125.999999993432    2417242934291.85
   10                  6                42          1.41737398829092E-09        47.9999999985826    1625541331291.19

作为 sessionnumber 1 和 sessionnumber 2 的正确 Chi Square 测试应该等于 9.117,因为我的查询给了我错误的结果。(此卡方是前 2 个 sessionnumbers 行的示例但正确的值)。因此,自过去 3 天以来,我一直在努力寻找答案并努力工作。最后发现我的这个查询有问题,它给了我错误的结果。

请有人帮助我,我会对此有所帮助!(我也会在 2 天后就这个问题申请赏金)。提前谢谢请帮助我,因为我对 SQL 查询有一点了解,因为我对它很陌生,因为我只使用了大约 3 个月!所以我真的需要一些帮助!

4

1 回答 1

3

卡方值是在 2 维列联表上定义的,而不是在 3 维列联表上定义的。您似乎正在将二维公式调整为三个维度。而且,它们只是不起作用。

可以将卡方推广到更高维度的测试。我在这篇文中讨论了这个问题,以及我反对这种方法的原因。

我建议您将问题改写为二维卡方检验,并将代码中的算术应用于此问题。也就是说,一次进行两个维度的分析。

编辑:

我认为您不了解卡方检验。当您有两个维度的分类变量时应用它。例如,您可能有“颜色”和“响应”以及具有以下内容的矩阵:

Color     Yes     No
Red        18    203
Blue       10    182
Green      22    134

并且您想知道随机创建矩阵的概率(可能性)——假设边际分布(维度上的总数)是相同的。

您的示例有两个或三个(如果您包含“sessionnumber”)数字变量。您应该研究替代统计技术。我实际上会从单变量相关分析(皮尔逊相关)和线性回归开始。

编辑二:

我正在为卡方查询提供正确的形式,即使我不提倡对您的数据使用卡方检验。这些列可能是相关的(具有高会话计数的实例似乎相似,即使它们不在同一个存储桶中)。

您的查询具有正确的形式,只需删除其中一个维度:

SELECT sessioncount, timespent, expected, dev,
       dev*dev/cast(expected as float) as chi_square
FROM (SELECT d3.sessionnumber, d3.sessioncount, d3.timespent,
             (dim2.cnt * dim3.cnt)/cast((dimall.cnt*dimall.cnt)as float) as expected,
             d3.cnt-(dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as dev
      FROM d3 JOIN
           (SELECT sessioncount, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY sessioncount
           ) dim2
           ON d3.sessioncount = dim2.sessioncount JOIN
           (SELECT timespent, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY timespent
           ) dim3
           ON d3.timespent = dim3.timespent CROSS JOIN
           (SELECT SUM(cast(cnt as float)) as cnt
            FROM d3
          ) dimall
     ) a

这适用于表格中的单元格。但是,要获得完整的卡方值,您需要考虑所有单元格,即使是计数为 0 的单元格:

SELECT sessioncount, timespent, cnt, expected, dev,
       dev*dev/cast(expected as float) as chi_square
FROM (SELECT allcells.sessioncount, allcells.timespent,
             cells.cnt,
             (dim2.cnt * dim3.cnt)/cast(dimall.cnt as float) as expected,
             coalesce(cells.cnt, 0) - (dim2.cnt * dim3.cnt)/dimall.cnt as dev
      FROM (select sc.sessioncount, ts.timespent
            from (select distinct sessioncount from d3) sc cross join
                 (select distinct timespent from d3) ts
           ) allcells left join
           (select sessioncount, timespent, sum(cnt) as cnt
            from d3
            group by sessioncount, timespent
           ) cells
           on allcells.sessioncount = cells.sessioncount and
              allcells.timespent = cells.timespent left JOIN
           (SELECT sessioncount, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY sessioncount
           ) dim2
           ON allcells.sessioncount = dim2.sessioncount left JOIN
           (SELECT timespent, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY timespent
           ) dim3
           ON allcells.timespent = dim3.timespent CROSS JOIN
           (SELECT SUM(cast(cnt as float)) as cnt
            FROM d3
          ) dimall
     ) a

是一个具有此功能的 SQL Fiddle。

而且,您的原始查询可能适用于多维卡方。但是,我没有仔细查看数据。通常,当数据有一个 cnt 时,它是列联表的形式(可能缺少“0”单元格)。您的数据包含跨多行拆分的单元格(特别是“1, 1”)。所以,上面的版本考虑到了这一点。

而且,因为您最初的问题是关于 3 维卡方,所以这是正确的查询:

SELECT sessioncount, timespent, cnt, expected, dev,
       dev*dev/cast(expected as float) as chi_square
FROM (SELECT allcells.sessionnumber, allcells.sessioncount, allcells.timespent,
             cells.cnt,
             (dim1.cnt * dim2.cnt * dim3.cnt)/cast(dimall.cnt*dimall.cnt as float) as expected,
             coalesce(cells.cnt, 0) - (dim1.cnt * dim2.cnt * dim3.cnt)/(dimall.cnt*dimall.cnt) as dev
      FROM (select sn.sessionnumber, sc.sessioncount, ts.timespent
            from (select distinct sessioncount from d3) sc cross join
                 (select distinct timespent from d3) ts cross join
                 (select distinct sessionnumber from d3) sn
           ) allcells left join
           (select sessionnumber, sessioncount, timespent, sum(cnt) as cnt
            from d3
            group by sessionnumber, sessioncount, timespent
           ) cells
           on allcells.sessioncount = cells.sessioncount and
              allcells.timespent = cells.timespent and
              allcells.sessionnumber = cells.sessionnumber left JOIN
           (SELECT sessionnumber, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY sessionnumber
           ) dim1
           ON allcells.sessionnumber = dim1.sessionnumber left JOIN
            (SELECT sessioncount, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY sessioncount
           ) dim2
           ON allcells.sessioncount = dim2.sessioncount left JOIN
           (SELECT timespent, SUM(cast(cnt as float)) as cnt
            FROM d3
            GROUP BY timespent
           ) dim3
           ON allcells.timespent = dim3.timespent CROSS JOIN
           (SELECT SUM(cast(cnt as float)) as cnt
            FROM d3
          ) dimall
     ) a

连同其相应的SQL Fiddle

对于这两个 SQL Fiddle 版本,我已经验证了期望值的总和等于原始计数的总和,这是对算术的一个很好的验证。

于 2013-08-05T00:44:45.510 回答