java - 大蛋白质序列中的比对序列

Question

我有一个大约 5000 的大蛋白质序列，所以我把它放在一个文本文件（p_sqn.txt）中，我有以下序列

例如 ; SDJGKLDJGSNMMUWEURYI

我必须找到百分比同一性评分函数，因此我必须找到蛋白质序列中最相似的序列。（蛋白质序列.txt）

score 1 · Accepted Answer

长度只有 5000，通过它不会花费很长时间（毫秒）。

幸运的是，Apache commons-lang 库提供了StringUtils.getLevenshteinDistance()实用方法。有了这个，代码将只有几行：

import org.apache.commons.lang.StringUtils;

String protein; // the full sequence
String part; // your search string
int bestScore = Integer.MAX_VALUE;
int bestLocation = 0;
String bestSeqence = "";
for (int i = 0; i < protein.length() - part.length(); i++) {
    String sequence = protein.substring(i, part.length());
    int score = StringUtils.getLevenshteinDistance(sequence, part);
    if (score < bestScore) {
        bestScore = score;
        bestLocation = i;
        bestSeqence = sequence;
    }
}

// at this point in the code, the "best" variables will have data about the best match.

仅供参考，零分意味着找到了完全匹配。

为了便于读取文件，您可以使用Apache common-io 库实用程序方法FileUtils.readFileToString()，如下所示：

import org.apache.commons.io.FileUtils;

String protein = FileUtils.readFileToString(new File("/some/path/to/myproteinfile.txt"));

java - 大蛋白质序列中的比对序列

1 回答 1

Related

Reference