mysql - mysql - 在组的子组中查找唯一的字符串匹配

Question

我有一个 mysql“问题”，我很难解决问题。

我有一个来自数据库的字符串表（实际上是基因型，但不应该相关），可以存在于任何一到三个样本中。我想计算每个目录 id (c_id) 的每个样本 (s_id) 的唯一等位基因的数量。例如给出下表：

id   batch_id  catalog_id   sample_id   tag_id      allele  depth
309     1       324             1         323         TCGC  244
1449616 1       324             2         7961        TCGC  192
2738325 1       324             2        1168472      CCGG  31
3521555 1       324             3        221716       TAAC  29

到目前为止，我已经能够构建以下代码：

CREATE TABLE danumbers2
SELECT catalog_id,
count(case when sample_id = '1' and allele != 'consensus' then sample_id end) as SAMPLE1,
count(case when sample_id = '2' and allele != 'consensus' then sample_id end) as SAMPLE2,
count(case when sample_id = '3' and allele != 'consensus' then sample_id end) as SAMPLE3,
sum(case when sample_id = '1' and allele != 'consensus' then depth end) as DEPTH1,
sum(case when sample_id = '2' and allele != 'consensus' then depth end) as DEPTH2,
sum(case when sample_id = '3' and allele != 'consensus' then depth end) as DEPTH3,
count(distinct allele) AS ALLELECOUNT

from matches as danumbers
group by catalog_id

CREATE TABLE thehitlist_all
SELECT catalog_id,SAMPLE1,SAMPLE2,SAMPLE3,DEPTH1,DEPTH2,DEPTH3,ALLELECOUNT
FROM danumbers
WHERE(SAMPLE1>1 SAMPLE2>1 AND SAMPLE3>1 AND ALLELECOUNT>1 AND DEPTH2>10 AND DEPTH3>10)

这给出了这个结果：

catalog_id  SAMPLE1 SAMPLE2 SAMPLE3 DEPTH1  DEPTH2  DEPTH3  ALLELECOUNT
324           1    2        1    244     223     29     4

结果本质上是每个样本中等位基因总数的 catalog_id 排序计数，以及每个目录 id的不同等位基因总数的计数。我对计算感兴趣（但似乎无法弄清楚！）是样本之间不共享的“独特”等位基因。换句话说，在每个目录 ID 处为每个样本查找诊断“等位基因”。

因此，对于上面的示例数据，我希望表格如下所示：

catalog_id  SAMPLE1 SAMPLE2 SAMPLE3 ALLELECOUNT
324           0    1        1       2

任何想法将不胜感激！请让我知道我是否可以提供更多信息等。

score 2 · Accepted Answer

您可以简单地将其他列名称添加到COUNT(DISTINCT...：

COUNT(DISTINCT s_id, allele) AS ALLELECOUNT

这将计算和的唯一s_id组合allele。

score 0 · Accepted Answer

这将为您提供在 catalog_id 中等位基因诊断的匹配的完整记录：

select good.*
from matches good
  left join matches dq on
    dq.catalog_id = good.catalog_id and
    dq.allele = good.allele and
    dq.sample_id != good.sample_id
where dq.catalog_id is null

从这里，您应该能够转储到临时表并使用类似于您已经说明的技术轻松总结。如果您愿意，您可以跳过临时表并直接进入摘要。

它只会过滤掉那些在每个目录中找到多个样本的等位基因的行。如果在同一个目录中为同一个样本找到同一个等位基因，那么这仍然会为它返回一行。如果您想选择每个目录仅找到一个记录的等位基因（而不是每个目录一个样本），那么您可以将 dq.sample_id != good.sample_id 更改为 dq.id != good.id

mysql - mysql - 在组的子组中查找唯一的字符串匹配

2 回答 2

Related

Reference