r - 具有大计数和小计数的 2x4 列联表的统计信息

Question

如果这是一个非常幼稚的问题，我深表歉意......

我有 7000 个带有计数数据的 2x4 列联表。它们代表基因组中的特定位置以及在 2 个不同环境中在该位置观察到每个 dna 核苷酸的次数。一个示例列联表将是

            A      C      G      T 
condition1  0      2      20     70000
condition2  3      15     0      95000

or
            A      C     G       T 
condition1  80146  0     5       0
condition2  26821  2     4       0

数据只能是正整数。最小计数为 0，最大可高达 ~800,000。一个计数通常是该行和该列的几乎所有总计数（例如，在两种情况下都相同，例如上面第一种情况中的单元格 T 和第二种情况中的单元格 A），然后 1 或 2 个其他单元格将具有低计数......应该在这些其他单元格中观察到差异（如果有的话）。

目标是确定这两种环境条件之间显着不同的位置，以进一步分析。我们的测量方法估计有10^-6的错误率。

我正在使用 R 来分析这些数据。我不确定我是否可以对此进行卡方检验，因为单元格的计数很小或为 0。通过 Fisher 的测试，我得到 2 个错误：

with a workspace of 1E5 
FEXACT error 40.
Out of workspace.

with a workspace of >3E5
FEXACT error 501.
The hash table key cannot be computed because the largest key
is larger than the largest representable int.
The algorithm cannot proceed.
Reduce the workspace size or use another algorithm.

任何人都可以建议一个适当的测试，或者为渔夫或卡方设置吗？

提前谢谢了，

罗恩

score 0 · Accepted Answer

Fisher 在 R 中的精确检验仅适用于较小的数据。如果将 T 列中的数据从 70000 和 95000 减少到 700 和 950，Fisher 检验将起作用。

同时，我在您的数据上尝试了 chisq.test，它奏效了。对于较大的数据，卡方检验优于 Fisher 精确检验。

score 0 · Accepted Answer

卡方检验工作：

df1 = structure(list(A = c(0L, 3L), C = c(2L, 15L), G = c(20L, 0L), 
    T = c(70000L, 95000L)), .Names = c("A", "C", "G", "T"), class = "data.frame", row.names = 1:2)

df1
  A  C  G     T
1 0  2 20 70000
2 3 15  0 95000

chisq.test(df1)

        Pearson's Chi-squared test

data:  df1
X-squared = 35.8943, df = 3, p-value = 7.884e-08

Warning message:
In chisq.test(df1) : Chi-squared approximation may be incorrect

我不确定这是否足够。

r - 具有大计数和小计数的 2x4 列联表的统计信息

2 回答 2

Related

Reference