0

我正在访问一个 chado 结构化的 mysql 数据库。我通过基因产品搜索,对于这个例子,产品是“双功能 GDP-岩藻糖合成酶:GDP-4-dehydro-6-deoxy-D-甘露糖差向异构酶和 GDP-4-dehydro-6-L-deoxygalactose reductase”。

然后,我可以使用 JOIN 语句来查找该基因所在的程序集及其坐标。下面的 SQL 语句是有效的,将返回程序集的序列(不仅仅是基因的序列),以及感兴趣的基因在程序集上的开始和停止位置。

SELECT f.uniquename AS protein_accession, product.value AS protein_name, srcfeature.residues AS residue_sequence, srcassembly.name AS source_type, location.fmin AS location_min, location.fmax AS location_max, location.strand
FROM feature f
JOIN cvterm polypeptide ON f.type_id=polypeptide.cvterm_id
JOIN featureprop product ON f.feature_id=product.feature_id
JOIN cvterm productprop ON product.type_id=productprop.cvterm_id
JOIN featureloc location ON f.feature_id=location.feature_id
JOIN feature srcfeature ON location.srcfeature_id=srcfeature.feature_id
JOIN cvterm srcassembly ON srcfeature.type_id=srcassembly.cvterm_id
WHERE polypeptide.name = 'polypeptide'
AND productprop.name = 'gene_product_name'
AND product.value LIKE '%bifunctional GDP-fucose synthetase: GDP-4-dehydro-6-deoxy-D-mannose epimerase and GDP-4-dehydro-6-L-deoxygalactose reductase%';

组装顺序非常长,我绝对不需要全部。使用 MySQL 的 SUBSTRING 方法提取我需要的部分以保存检索整个序列,还是在检索后使用编程语言的 substring 方法更好?下面的查询是我尝试使用在查询位置和长度期间获得的值的 SUBSTRING 方法。它不起作用,我的猜测是它需要多个 SELECT 语句才能工作。SQL 变得非常丑陋,我什至不确定最终的工作结果会更好。

您的想法是,使用 SQL SUBSTRING 执行此操作更好,还是仅使用编程语言和子字符串方法来显示我想要的内容,即使我已经检索了整个内容?

SELECT f.uniquename AS protein_accession, product.value AS protein_name, SUBSTRING(srcfeature.residues AS residue_sequence, location_min, location_max - location_min), srcassembly.name AS source_type, location.fmin AS location_min, location.fmax AS location_max, location.strand
FROM feature f
JOIN cvterm polypeptide ON f.type_id=polypeptide.cvterm_id
JOIN featureprop product ON f.feature_id=product.feature_id
JOIN cvterm productprop ON product.type_id=productprop.cvterm_id
JOIN featureloc location ON f.feature_id=location.feature_id
JOIN feature srcfeature ON location.srcfeature_id=srcfeature.feature_id
JOIN cvterm srcassembly ON srcfeature.type_id=srcassembly.cvterm_id
WHERE polypeptide.name = 'polypeptide'
AND productprop.name = 'gene_product_name'
AND product.value LIKE '%bifunctional GDP-fucose synthetase: GDP-4-dehydro-6-deoxy-D-mannose epimerase and GDP-4-dehydro-6-L-deoxygalactose reductase%';

编辑 这是一个不同基因(较短名称)的示例结果。我省略了查询序列中的部分,因为该部分有数千个字符长。我必须正确使用此处显示的 location_min 和 location_max 的值来 SUBSTRING。

+-------------------+---------------------------------------------------+-------------+--------------+--------------+--------+
| protein_accession | protein_name                                      | source_type | location_min | location_max | strand |
+-------------------+---------------------------------------------------+-------------+--------------+--------------+--------+
| ECDH10B_0026      | bifunctional riboflavin kinase and FAD synthetase | assembly    |        21406 |        22348 |      1 |
+-------------------+---------------------------------------------------+-------------+--------------+--------------+--------+
4

2 回答 2

1

as来错地方了。它需要在结束括号之后进行substring()

SELECT f.uniquename AS protein_accession, product.value AS protein_name,
       SUBSTRING(srcfeature.residues, location_min, location_max - location_min)  AS residue_sequence,
       srcassembly.name AS source_type, location.fmin AS location_min, location.fmax AS location_max, location.strand
FROM feature f
JOIN cvterm polypeptide ON f.type_id=polypeptide.cvterm_id
JOIN featureprop product ON f.feature_id=product.feature_id
JOIN cvterm productprop ON product.type_id=productprop.cvterm_id
JOIN featureloc location ON f.feature_id=location.feature_id
JOIN feature srcfeature ON location.srcfeature_id=srcfeature.feature_id
JOIN cvterm srcassembly ON srcfeature.type_id=srcassembly.cvterm_id
WHERE polypeptide.name = 'polypeptide'
AND productprop.name = 'gene_product_name'
AND product.value LIKE '%bifunctional GDP-fucose synthetase: GDP-4-dehydro-6-deoxy-D-mannose epimerase and GDP-4-dehydro-6-L-deoxygalactose reductase%';

至于您的另一个问题,我认为在查询中提取所需的数据比将不必要的数据传回应用程序更有意义。这节省了通信开销。另外,如果数据库使用多个线程/处理器,它就有机会并行运行。

于 2013-04-23T20:57:50.657 回答
0

如果这样的事情对你有用:

SELECT f.uniquename AS protein_accession, 
       product.value AS protein_name, 
       SUBSTRING(
                   srcfeature.residues, 
                   patindex('%SOMPATTERN%',srcfeature.residues), 
                   LEN(srcfeature.residues) - patindex('%SOMPATTERN%',srcfeature.residues)
                ) AS residue_sequence, 
      srcassembly.name AS source_type, 

然后在 SQL 中尝试。如果没有,请使用应用程序编程语言。

于 2013-04-23T20:56:08.793 回答