'subject' 参数matchPattern
是一个特殊的对象(例如 XString)。您可以通过使用 paste 折叠序列并使用?BString
.
因此,使用您的数据:
file = read.fasta(file = "mydata.txt")
# find 'atg' locations
atg <- lapply(file, function(x) {
string <- BString(paste(x, collapse = ""))
matchPattern("atg", string)
})
atg[1:2]
# $a
# Views on a 18-letter BString subject
# subject: atgacccccaccgagtaa
# views:
# start end width
# [1] 1 3 3 [atg]
#
# $b
# Views on a 21-letter BString subject
# subject: atgcccactgtcatcacctaa
# views:
# start end width
# [1] 1 3 3 [atg]
举个简单的例子,在一个序列中查找 'atg' 的数量和位置:
sequence <- BString("atgatgccatgcccccatgcatgatatg")
result <- matchPattern("atg", sequence)
# Views on a 28-letter BString subject
# subject: atgatgccatgcccccatgcatgatatg
# views:
# start end width
# [1] 1 3 3 [atg]
# [2] 4 6 3 [atg]
# [3] 9 11 3 [atg]
# [4] 17 19 3 [atg]
# [5] 21 23 3 [atg]
# [6] 26 28 3 [atg]
# Find out how many 'atg's were found
length(result)
# [1] 6
# Get the start site of each 'atg'
result@ranges@start
# [1] 1 4 9 17 21 26
此外,检查?DNAString
和?RNAString
。它们与BString
仅限于核苷酸字符相似,并允许在 DNA 和 RNA 序列之间进行快速比较。
编辑以解决评论中提到的帧移位问题:您可以使用@DWin 提到的模技巧对结果进行子集化以获得帧中的那些'atg'。
# assuming the first 'atg' sets the frame
in.frame.result <- result[(result@ranges@start - result@ranges@start[1]) %% 3 == 0]
# Views on a 28-letter DNAString subject
# subject: ATGATGCCATGCCCCCATGCATGATATG
# views:
# start end width
# [1] 1 3 3 [ATG]
# [2] 4 6 3 [ATG]
# There are two 'atg's in frame in this result
length(in.frame.result)
# [1] 2
# With your data:
file = read.fasta(file = "mydata.txt")
atg <- lapply(file, function(x) {
string <- BString(paste(x, collapse = ""))
result <- matchPattern("atg", string)
result[(result@ranges@start - result@ranges@start[1]) %% 3 == 0]
})