我有一个 mysql“问题”,我很难解决问题。
我有一个来自数据库的字符串表(实际上是基因型,但不应该相关),可以存在于任何一到三个样本中。我想计算每个目录 id (c_id) 的每个样本 (s_id) 的唯一等位基因的数量。例如给出下表:
id batch_id catalog_id sample_id tag_id allele depth
309 1 324 1 323 TCGC 244
1449616 1 324 2 7961 TCGC 192
2738325 1 324 2 1168472 CCGG 31
3521555 1 324 3 221716 TAAC 29
到目前为止,我已经能够构建以下代码:
CREATE TABLE danumbers2
SELECT catalog_id,
count(case when sample_id = '1' and allele != 'consensus' then sample_id end) as SAMPLE1,
count(case when sample_id = '2' and allele != 'consensus' then sample_id end) as SAMPLE2,
count(case when sample_id = '3' and allele != 'consensus' then sample_id end) as SAMPLE3,
sum(case when sample_id = '1' and allele != 'consensus' then depth end) as DEPTH1,
sum(case when sample_id = '2' and allele != 'consensus' then depth end) as DEPTH2,
sum(case when sample_id = '3' and allele != 'consensus' then depth end) as DEPTH3,
count(distinct allele) AS ALLELECOUNT
from matches as danumbers
group by catalog_id
CREATE TABLE thehitlist_all
SELECT catalog_id,SAMPLE1,SAMPLE2,SAMPLE3,DEPTH1,DEPTH2,DEPTH3,ALLELECOUNT
FROM danumbers
WHERE(SAMPLE1>1 SAMPLE2>1 AND SAMPLE3>1 AND ALLELECOUNT>1 AND DEPTH2>10 AND DEPTH3>10)
这给出了这个结果:
catalog_id SAMPLE1 SAMPLE2 SAMPLE3 DEPTH1 DEPTH2 DEPTH3 ALLELECOUNT
324 1 2 1 244 223 29 4
结果本质上是每个样本中等位基因总数的 catalog_id 排序计数,以及每个目录 id的不同等位基因总数的计数。我对计算感兴趣(但似乎无法弄清楚!)是样本之间不共享的“独特”等位基因。换句话说,在每个目录 ID 处为每个样本查找诊断“等位基因”。
因此,对于上面的示例数据,我希望表格如下所示:
catalog_id SAMPLE1 SAMPLE2 SAMPLE3 ALLELECOUNT
324 0 1 1 2
任何想法将不胜感激!请让我知道我是否可以提供更多信息等。