看起来您正在尝试获取参考和替代等位基因?只寻找一个字符表明您只对 SNP 感兴趣?您可以使用 strsplit 生成 ref 和 alt 等位基因的数据框。
test <- c("G>GA", "T>A", "G>A", "G>A", "A>T", "CT>C", "T>C", "T>C", "A>T", "T>C", "T>A", "A>G", "CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C", "C>T", "T>A", "T>C", "T>G", "G>C", "T>G", "T>A", "G>A")
Alleles <- data.frame(t(data.frame(sapply(test, function(x) strsplit(x,split=">")))),row.names=NULL,stringsAsFactors=F)
colnames(Alleles) <- c("Ref","Alt")
Alleles$bases <- apply(Alleles,1,function(x) sum(length(unlist(strsplit(x[1],split=""))),length(unlist(strsplit(x[2],split="")))))
SNPs <- Alleles[Alleles$bases == 2,]
仅在替换 (>) 的任一侧取一个碱基就会给你错误的遗传信息。变体“CCGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>C”将简化为“A>C”——它看起来像一个简单的 SNP,但与删除最后 38 个碱基“CGCCGCGGCCGCCGTCTTCCACCAACAACATGGCGGA>-”相同。
这就是你所追求的吗?