我有 100 个 fasta 文件,我想绘制遗传距离矩阵的重叠直方图,以查看 DNA 数据的引导复制之间有多少重叠?
我已经想出如何让猿使用以下方法读取每个文件:
files <- list.files("/Volumes/ALEX_R-HD/", pattern="/Volumes/ALEX_R-HD/xii-27-D")
library("ape")
library("pegas")
library("plyr")
library("dostats")
filenames <- dir(path="/Volumes/ALEX_R-HD/xii-27_D_coccus", full.names="TRUE", pattern="xii-27")
listOfiles <- lapply(filenames, function(x) read.dna(x, format="fasta"))
然后使用以下方法为每个生成一个遗传距离矩阵:
distOfiles <- lapply(listOfiles, function(y) dist.dna(y, model="TN93"))
当我从 R 控制台调用它们时,遗传距离文件如下所示:
[[1]]
M_51_1_new__ M_51_3_new__ M_51_4_new2__ M_51_5_new2__ M_51_6_new__ M_51_7_new__ M_51_8_new__ madera_1_new__ madera_2_new__ madera_3__ madera_4_new__ madera_5_new__
M_51_3_new__ 0.000000000
M_51_4_new2__ 0.000000000 0.000000000
M_51_5_new2__ 0.000000000 0.000000000 0.000000000
M_51_6_new__ 0.000000000 0.000000000 0.000000000 0.000000000
M_51_7_new__ 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
M_51_8_new__ 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000 0.000000000
madera_1_new__ 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343
madera_2_new__ 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.000000000
madera_3__ 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.037124343 0.000000000
and goes on to.... [[100]]
我遇到麻烦的地方是绘制每个直方图,以便每个引导程序将在同一个窗口中绘制在另一个之上,下面的脚本只是在一个全新的窗口中绘制每个,并且不会重叠它们:
bins=seq(0,0.05,by=0.001)
HistOfiles <- lapply(distOfiles, function(z) hist(z, breaks=bins, main="Histogram of D. coccus Mexico-types TN93 distances", ylim=c(0,1500), xlab="TN93 distance", ylab="frequency", col=rgb(0,0,0,0.01), border=rgb(0,0,0,0.01)))
我知道这可以通过以下方式完成:
bins=seq(0,0.05,by=0.001)
readfile1 <- read.dna("/Volumes/ALEX_R-HD/xii-27_D_coccus/xii-27-D_coccus1", format="fasta")
distance_TN931 <- dist.dna(readfile1, model="TN93")
bins=seq(0,0.05,by=0.001)
hist(distance_TN931, breaks=bins, main="Histogram of D. coccus Mexico-types TN93 distances", ylim=c(0,1500), xlab="TN93 distance", ylab="frequency", col=rgb(0,0,0,0.01), border=rgb(0,0,0,0.01))
lines(density(distance_TN931), col=rgb(1,0,0,0.01))
par(new=TRUE)
readfile2 <- read.dna("/Volumes/ALEX_R-HD/xii-27_D_coccus/xii-27-D_coccus2", format="fasta")
distance_TN932 <- dist.dna(readfile2, model="TN93")
bins=seq(0,0.05,by=0.001)
hist(distance_TN932, breaks=bins, ylim="", main="", xlab="", ylab="", col=rgb(0,0,0,0.01), border=rgb(0,0,0,0.01))
lines(density(distance_TN932), col=rgb(1,0,0,0.01))
par(new=TRUE)
.......到最后一个文件
但我认为这将是很多工作,这对于 100 个文件来说很好,但如果其他人拥有 1,000 个文件(例如,使用 GenBank 数据工作的人等),这可能太多了。
我还尝试通过使用一些 Unix 将不同的文件粘贴到 \t 分隔的列列表中来解决这个问题:
paste /Volumes/ALEX_R-HD/xii-27_D_coccus/xii-27-D_coccus* /Volumes/ALEX_R-HD/xii-27_D_coccus/blank > /Volumes/ALEX_R-HD/xii-27_D_coccus/blank
该文件看起来像这样,我“” \t 试图明确它们是如何分开的
>name1 "\t" >name1 "\t" >name1 ...... 100 times for each row
actgactg "\t" actgaca "\t" actgaca
actgttgc "\t" actgact "\t" actgaca
>name2 "\t" >name2 "\t" >name2
actgactg "\t" actgaca "\t" actgaca
actgttgc "\t" actgact "\t" actgaca
但我不知道如何让 read.dna 将每一列作为单独的数据矩阵读取,我可以让 read.table 读取文件,但卡在那里,
在这一点上我完全被难住了,因为我是一个新的 R 用户,我已经在网上做了很多寻找这个问题的解决方案,似乎没有一个我发现不涉及一些正如我上面所描述的那样做这件事的困难方法的变体,也许 lattice 可以完成工作?