-1

作为更长和更复杂查询的一部分,我试图只保留一个重叠间隔的条目,以及所有不重叠的条目。这是一个最小的例子:

create table protein (
    seqid varchar(100),
    start SMALLINT(5),
    `end` SMALLINT(5),
    cutoff FLOAT(5,4),
    seq_region TEXT
);

insert into protein (seqid, start, `end`, cutoff, seq_region) values ("A0MZ66", 280, 290, 0.75, "RIQHQQKVKEL");
insert into protein (seqid, start, `end`, cutoff, seq_region) values ("A0MZ66", 314, 556, 0.75, "EEDKKELELKYQNSEEKARNLKHSVDELQKRVNQSENSVPPPPPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLKRQAVEEMMDRIKKGVHLRPVNQTARPKTKPESSKGCESAVDELKGILGTLNKSTSSRSLKSLDPENSETELERILRRRKVTAEADSSSPTGILATSESKSMPVLGSVSSVTKTALNKKTLEAEFNSPSPPTPEPGEGPRKLEGCTSSKVT");
insert into protein (seqid, start, `end`, cutoff, seq_region) values ("A0MZ66", 356, 406, 1.0, "PPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLK");

SELECT *  from protein;
A0MZ66|280|290|0.75|CCCCCC
A0MZ66|314|556|0.75|ABCDEFG
A0MZ66|356|406|1.0|ABCD

条目 2 和 3 具有相同的 id 和重叠范围(从一个开始和结束包含在另一个中),但不同cutoffseq_region. 条目#3 实际上是条目#2 的子串。我不能放入sql的是条件:

  • 如果来自同一 seqid 的两个范围重叠,则选择得分 == 0.75(或最长的 seq_region,因为这些属性绑定在一起)的范围

所需的输出应该是条目 #1 和 #2:

A0MZ66|280|290|0.75|RIQHQQKVKEL
A0MZ66|314|556|0.75|EEDKKELELKYQNSEEKARNLKHSVDELQKRVNQSENSVPPPPPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLKRQAVEEMMDRIKKGVHLRPVNQTARPKTKPESSKGCESAVDELKGILGTLNKSTSSRSLKSLDPENSETELERILRRRKVTAEADSSSPTGILATSESKSMPVLGSVSSVTKTALNKKTLEAEFNSPSPPTPEPGEGPRKLEGCTSSKVT

如何将其作为 SQL 查询?重叠条件可以假设一个区间总是包含在另一个区间中(开始或结束可以相同)。如果重要的话,它是一个 SQLite3 数据库。

我想我需要为此做某种自我内部连接,或者按操作分组,但我不能完全正确。非常感谢您的意见。

4

2 回答 2

1

您可以使用NOT EXISTS

select p.* from protein p
where not exists (
  select 1 from protein
  where seqid = p.seqid and cutoff <> p.cutoff and seq_region <> p.seq_region
  and seq_region like '%' || p.seq_region || '%'
)

请参阅演示

或者,如果您想使用列startend获得重叠间隔:

select p.* from protein p
where not exists (
  select 1 from protein
  where seqid = p.seqid and cutoff <> p.cutoff and seq_region <> p.seq_region
  and start <= p.start and end >= p.end and (end - start) > (p.end - p.start)
)

请参阅演示

结果:

| seqid  | start | end | cutoff | seq_region                                                                                                                                                                                                                                          |
| ------ | ----- | --- | ------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| A0MZ66 | 280   | 290 | 0.75   | RIQHQQKVKEL                                                                                                                                                                                                                                         |
| A0MZ66 | 314   | 556 | 0.75   | EEDKKELELKYQNSEEKARNLKHSVDELQKRVNQSENSVPPPPPPPPPLPPPPPNPIRSLMSMIRKRSHPSGSGAKKEKATQPETTEEVTDLKRQAVEEMMDRIKKGVHLRPVNQTARPKTKPESSKGCESAVDELKGILGTLNKSTSSRSLKSLDPENSETELERILRRRKVTAEADSSSPTGILATSESKSMPVLGSVSSVTKTALNKKTLEAEFNSPSPPTPEPGEGPRKLEGCTSSKVT |
于 2020-07-31T13:50:49.387 回答
0

这是一个“差距和岛屿”问题。首先,您需要识别同一组中的行,然后根据您的标准从每一行中选择一个。例如,您可以按如下所示解析此查询:

with
y as (
  select *,
    sum(st) over(partition by seqid order by start, end) as grp
  from (
    select *,
      case when start > 
             max(`end`) 
               over(partition by seqid 
               order by start, end
               rows between unbounded preceding and 1 preceding) 
           then 1 else 0 end as st
    from protein
  ) x
),
z as (
  select *,
    row_number() over(partition by seqid, grp 
      order by case when cutoff = 0.75 then 1 else 2 end,
               length(seq_region) desc) as rn
  from y
)
select * from z where rn = 1

结果:

seqid   start  end  cutoff  seq_region    st  grp  rn 
------- ------ ---- ------- ------------- --- ---- -- 
A0MZ66  280    290  0.75    RIQHQQKVKEL   0   0    1  
A0MZ66  314    556  0.75    EEDKKELELK... 1   1    1  

请参阅DB Fiddle上的运行示例。

于 2020-07-31T14:04:41.927 回答