r - 为什么在 R 中使用 plotcounts 和 ggplot 绘制时，我的一个组织样本的 DEG 是单独绘制的？

Question

我正在比较来自三种不同组织“肝脏”、“肾脏”和“大脑”的动物的三个年龄“新生儿”、“四岁”和“二十岁”的 RNA-seq 数据。我的colata如下所示。我成功地运行了 DESeq2 工具来分析差异表达的基因。但是，当我使用“plotCounts”和“ggplot2”绘制具有最小 padj 值的差异表达基因时，三个组织之一的基因被单独绘制，两个一起绘制。我无法弄清楚我哪里出错了。如果有人可以查看我的脚本，请建议我将所有样本绘制在一起。提前感谢您的宝贵时间。

pasCts <- "C:/Users/krishna/Desktop/project/featurecounts_9samples/countmatrix.Rmatrix.txt"
pasAnno <- "C:/Users/krishna/Desktop/project/featurecounts_9samples/featurecounts_test.csv"

cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="Geneid"))
coldata <- read.csv(pasAnno, row.names=1)

coldata <- coldata[,c("condition","tissue")]
coldata$tissue <- factor(coldata$tissue)
coldata$condition <- factor(coldata$condition)

OUTPUT:
> coldata
          condition    tissue
SRR306394      "NB"   "Liver"
SRR306395    "four"   "Liver"
SRR306396  "twenty"   "Liver"
SRR306397      "NB"  "Kidney"
SRR306398    "four"  "Kidney"
SRR306399  "twenty"  "Kidney"
SRR306400      "NB"   "Brain"
SRR306401    "four"   "Brain"
SRR306402  "twenty"   "Brain"

##使coldata的行和矩阵（cts）的列的顺序相同：

all(rownames(coldata) %in% colnames(cts))  


all(rownames(coldata) == colnames(cts))  ### to check if the order is even

cts <- cts[, rownames(coldata)] 
all(rownames(coldata) == colnames(cts))

矩阵的输出（cts）：

OUTPUT:
> head(cts)
                       SRR306394 SRR306395 SRR306396 SRR306397 SRR306398 SRR306399 SRR306400 SRR306401 SRR306402
ENSMUSG00000102693         0         0         0         0         0         0         0         0         0
ENSMUSG00000064842         0         0         0         0         0         0         0         0         0
ENSMUSG00000051951         1         0         0         1         0         0        62        32        22

为数据创建 Deseq2 矩阵对象：

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds

预过滤 - 在这里我们删除读取计数非常低的行。

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

设置因子

dds$tissue <- factor(dds$tissue, levels = c("liver", "kidney", "brain"))

运行差异表达分析

dds <- DESeq(dds)
res <- results(dds)
res

要获得构建结果表的系数： resultsNames(dds) OUTPUT: 1 "Intercept" "condition_..NB. vs ..four." “条件_..二十。对..四。” [4] “condition_.four. vs ..four.” “条件_.NB。与..四个。” “条件_.二十。对..四。”

是否有可能只获得一个系数“condition_..NB. vs ..four._vs..twenty”？如果是，我应该使用什么代码？

基于 resultsName(dds) 获得的系数的对数倍数变化收缩：

## FOR COEF=  AND FOUR
resLFC_20vs4 <- lfcShrink(dds, coef=3, type="apeglm")
resLFC_20vs4

## FOR COEF= NB AND FOUR
resLFC_NBvs4 <- lfcShrink(dds, coef=2, type="apeglm")
resLFC_NBvs4

要按最小 p 值排序我们的结果表：

resOrdered <- res[order(res$pvalue),]
> resOrdered
> summary(resOrdered)  ### to summarize some basic tallies using the "summary" function

了解小于 0.1 的调整后 p 值的数量

sum(res$padj < 0.1, na.rm=TRUE)OUTPUT:2472

res05 <- results(dds, alpha=0.05) ###by default alpha = 0.1 but if adjpvalue is other than 0.1 then specify
res05
summary(res05)
sum(res05$padj < 0.05, na.rm=TRUE) ### to know the number of adj p value less than 0.05

运行上述代码后，我尝试使用 ggplot2 绘制具有 min padj 值的基因：

d <- plotCounts(dds, gene=which.min(res$padj), intgroup = "condition", returnData = TRUE)  ##plotting reads with minimun padj value

> dput(d)
structure(list(count = c(23099.7389197999, 19548.8369195126, 
17799.941667842, 20473.6092655006, 18165.0693569093, 13919.6719941735, 
1008.89639856882, 581.070434144846, 576.594165656907), condition = structure(c(2L, 
1L, 3L, 2L, 1L, 3L, 5L, 4L, 6L), .Label = c(" \"four\"", " \"NB\"", 
" \"twenty\"", "\"four\"", "\"NB\"", "\"twenty\""), class = "factor")), class = "data.frame", row.names = c("SRR306394", 
"SRR306395", "SRR306396", "SRR306397", "SRR306398", "SRR306399", 
"SRR306400", "SRR306401", "SRR306402"))

library("ggplot2")
    
ggplot(d, aes(x=condition, y=count))+
      geom_point(position=position_jitter(w=0.1,h=0))+
      scale_y_log10(breaks=c(25,100,400))

但是该图显示了两个组织样本的基因一起绘制，而第三个组织分别绘制。该图可以在这里看到： custom plotting using ggplot

谁能建议我正确的代码来获得显示所有组织的所有基因的图表？

感谢您的时间。

score 1 · Accepted Answer

我发现了问题并有解决方案，但我不知道如何防止它。

问题是前三个因子水平在引号前面有一个额外的空间，而后三个因子水平没有。

" \"four\""与"\"four\""dput 输出中的比较（\"模式只是需要打印这些引号，而不是字符串指示符）。

所以要解决这个问题，你只需要用空字符替换空格：

d$condition <- factor(gsub(" ", "", as.character(d$condition)))

ggplot(d, aes(x=condition, y=count))+
  geom_point(position=position_jitter(w=0.1,h=0))+
  scale_y_log10(breaks=c(25,100,400))

r - 为什么在 R 中使用 plotcounts 和 ggplot 绘制时，我的一个组织样本的 DEG 是单独绘制的？

了解小于 0.1 的调整后 p 值的数量

1 回答 1

Related

Reference