我正在访问一个 chado 结构化的 mysql 数据库。我通过基因产品搜索,对于这个例子,产品是“双功能 GDP-岩藻糖合成酶:GDP-4-dehydro-6-deoxy-D-甘露糖差向异构酶和 GDP-4-dehydro-6-L-deoxygalactose reductase”。
然后,我可以使用 JOIN 语句来查找该基因所在的程序集及其坐标。下面的 SQL 语句是有效的,将返回程序集的序列(不仅仅是基因的序列),以及感兴趣的基因在程序集上的开始和停止位置。
SELECT f.uniquename AS protein_accession, product.value AS protein_name, srcfeature.residues AS residue_sequence, srcassembly.name AS source_type, location.fmin AS location_min, location.fmax AS location_max, location.strand
FROM feature f
JOIN cvterm polypeptide ON f.type_id=polypeptide.cvterm_id
JOIN featureprop product ON f.feature_id=product.feature_id
JOIN cvterm productprop ON product.type_id=productprop.cvterm_id
JOIN featureloc location ON f.feature_id=location.feature_id
JOIN feature srcfeature ON location.srcfeature_id=srcfeature.feature_id
JOIN cvterm srcassembly ON srcfeature.type_id=srcassembly.cvterm_id
WHERE polypeptide.name = 'polypeptide'
AND productprop.name = 'gene_product_name'
AND product.value LIKE '%bifunctional GDP-fucose synthetase: GDP-4-dehydro-6-deoxy-D-mannose epimerase and GDP-4-dehydro-6-L-deoxygalactose reductase%';
组装顺序非常长,我绝对不需要全部。使用 MySQL 的 SUBSTRING 方法提取我需要的部分以保存检索整个序列,还是在检索后使用编程语言的 substring 方法更好?下面的查询是我尝试使用在查询位置和长度期间获得的值的 SUBSTRING 方法。它不起作用,我的猜测是它需要多个 SELECT 语句才能工作。SQL 变得非常丑陋,我什至不确定最终的工作结果会更好。
您的想法是,使用 SQL SUBSTRING 执行此操作更好,还是仅使用编程语言和子字符串方法来显示我想要的内容,即使我已经检索了整个内容?
SELECT f.uniquename AS protein_accession, product.value AS protein_name, SUBSTRING(srcfeature.residues AS residue_sequence, location_min, location_max - location_min), srcassembly.name AS source_type, location.fmin AS location_min, location.fmax AS location_max, location.strand
FROM feature f
JOIN cvterm polypeptide ON f.type_id=polypeptide.cvterm_id
JOIN featureprop product ON f.feature_id=product.feature_id
JOIN cvterm productprop ON product.type_id=productprop.cvterm_id
JOIN featureloc location ON f.feature_id=location.feature_id
JOIN feature srcfeature ON location.srcfeature_id=srcfeature.feature_id
JOIN cvterm srcassembly ON srcfeature.type_id=srcassembly.cvterm_id
WHERE polypeptide.name = 'polypeptide'
AND productprop.name = 'gene_product_name'
AND product.value LIKE '%bifunctional GDP-fucose synthetase: GDP-4-dehydro-6-deoxy-D-mannose epimerase and GDP-4-dehydro-6-L-deoxygalactose reductase%';
编辑 这是一个不同基因(较短名称)的示例结果。我省略了查询序列中的部分,因为该部分有数千个字符长。我必须正确使用此处显示的 location_min 和 location_max 的值来 SUBSTRING。
+-------------------+---------------------------------------------------+-------------+--------------+--------------+--------+
| protein_accession | protein_name | source_type | location_min | location_max | strand |
+-------------------+---------------------------------------------------+-------------+--------------+--------------+--------+
| ECDH10B_0026 | bifunctional riboflavin kinase and FAD synthetase | assembly | 21406 | 22348 | 1 |
+-------------------+---------------------------------------------------+-------------+--------------+--------------+--------+