r - 如何在不提供样品表的情况下加载 GEO 甲基化（450k）数据集？

Question

我从 Gene Expression Omnibus (GEO) 下载了一些 Illumina 450k 甲基化数据集

R Bioconductor 软件包 minfi 和 ChAMP 似乎需要所谓的“样品表”

GEO 上的大多数 TAR 文件似乎不包含这样的样本表 - 它们仅包含 .idat 文件

有好心人能给点建议吗？我想知道如何在没有样本表的情况下运行 ChAMP / Minfi 管道；否则，是否有任何方法可以从 .idat 文件生成样本表？

谢谢！

score 2 · Accepted Answer

这就是我获取样本表并将 idats 读入 RGSet 对象的方式：

#using pacman to install and load packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load("GEOquery","minfi")

#increase file download timeout
options(timeout = 600)

#download GEO object
gse <- getGEO("GSE12345", GSEMatrix = TRUE)
#get phenotype data - sample sheet
pd = pData(gse[[1]])

#get raw data - idats, processed beta matrix, etc.
getGEOSuppFiles("GSE12345")
#decompress idats
untar("GSE12345/GSE12345_RAW.tar", exdir = "GSE12345/idat")
#list files
head(list.files("GSE12345/idat", pattern = "idat"))
idatFiles <- list.files("GSE12345/idat", pattern = "idat.gz$", full = TRUE)
#decompress individual idat files
sapply(idatFiles, gunzip, overwrite = TRUE)
#read idats and create RGSet
RGSet <- read.metharray.exp("GSE12345/idat")

saveRDS(RGSet, "RGSet_GSE12345.RDS")

score 2 · Accepted Answer

我在 GEO 项目中遇到了类似的问题。我所做的是我下载了所有的 .idat 文件并将它们放在他们自己的文件夹中。然后我使用此代码解析 .idat 文件名并创建示例表。

它将解析文件名GSM1855609_9020331147_R02C02_Grn.idat，并将所有内容存储在 .csv 文件中。然后，您可以将 .csv 文件读入 R，添加c("Sample_Name", "Sentrix_ID", "Sentrix_Position")函数logger想要查看的标准化列名 ()，然后您就可以开始使用了。

希望这可以帮助！

#!/usr/bin/env python
# Import the OS library
import os

# Get your Current Working Directory
cwd = os.getcwd()

# Get a list of all of the files (and directories, if there are any) in your directory.
# This will be a list of strings.
filenames = os.listdir(cwd)

# Split each one into the chunks that were separated by underscores ("_") and then keep the first three for each name.
# This will be a list of lists.
chunked_names = [filename.split("_")[0:3] for filename in filenames]

# For each name, rejoin the three chunks with commas
# We're back to having a list of strings.
csv_lines = [",".join(chunks) for chunks in chunked_names]
# Join all of those strings with the newline character to get just a long string.
contents = "\n".join(csv_lines)

# Print this string to standard output so that it can be redirected to a file.

print(contents)

score 0 · Accepted Answer

如果你想从一个目录中读取所有的 idat 文件，你可以使用：

my_450k <- read.450k.exp(base = "path/to/directory", recursive = TRUE)

在某些阶段，您仍然需要通过样本条码将表型数据与 450k 数据进行匹配。

score 0 · Accepted Answer

较新的methylpreppython 包具有下载 GEO 数据集的功能。IT 适用于大多数系列，尽管其中许多系列的档案中没有相同类型的文件。

methylprep也有一个create sample_sheet命令行选项，如果你需要一个来输入minfi. 像这样：

 python -m methylprep -v sample_sheet -d ~/GSE133062/GSE133062 --create

（其中 -d 指定解压缩的 .idat 文件的路径）

更多示例： https ://readthedocs.com/projects/life-epigenetics-methylprep/

r - 如何在不提供样品表的情况下加载 GEO 甲基化（450k）数据集？

4 回答 4

Related

Reference