r - 检查一个非常大的制表符分隔文件的唯一值

Question

我有一个非常大的报告文件，我想检查一个名为的特定列Sample_id以获取唯一值。我的数据如下所示：

 [1] "[Header]"                                                                               
 [2] "GSGT Version\t1.9.4"                                                                     
 [3] "Processing Date\t7/6/2012 11:41 AM"                                                      
 [4] "Content\t\tGS0005701-OPA.opa\tGS0005702-OPA.opa\tGS0005703-OPA.opa\tGS0005704-OPA.opa"       
 [5] "Num SNPs\t5858"                                                                          
 [6] "Total SNPs\t5858"                                                                        
 [7] "Num Samples\t132"                                                                        
 [8] "Total Samples\t132"                                                                      
 [9] "[Data]"                                                                                 
[10] "SNP Name\tSample ID\tGC Score\tAllele1 - AB\tAllele2 - AB\tChr\tPosition\tGT Score\tX Raw\tY Raw"
[11] "rs1867749\t106N\t0.8333\tB\tB\t2\t120109057\t0.8333\t301\t378"                                   
[12] "rs1397354\t106N\t0.6461\tA\tB\t2\t215118936\t0.6461\t341\t192"                                   
[13] "rs2840531\t106N\t0.5922\tB\tB\t1\t2155821\t0.6091\t296\t391"                                     
[14] "rs649593\t106N\t0.8709\tA\tB\t1\t37635225\t0.8709\t357\t200"                                     
[15] "rs1517342\t106N\t0.4839\tA\tB\t2\t169218217\t0.4839\t316\t210"                                   
[16] "rs1517343\t106N\t0.5980\tA\tB\t2\t169218519\t0.5980\t312\t165"                                   
[17] "rs1868071\t106N\t0.5518\tA\tB\t2\t30219358\t0.5518\t355\t229"                                    
[18] "rs761162\t106N\t0.6923\tA\tB\t1\t13733834\t0.6923\t315\t257"                                     
[19] "rs911903\t106N\t0.6053\tA\tA\t1\t46982589\t0.6096\t383\t158"                                     
[20] "rs753646\t106N\t0.6676\tA\tB\t1\t208765509\t0.6688\t341\t169"

所以我的问题是如何Sample_ID使用 R 检查列的唯一值。我已经知道它的某些内容，unique但是如何使用制表符分隔的文件获取正确的列？

score 3 · Accepted Answer

首先读取文件：

sample_data <- read.table(file = "filename", sep = "\t", skip = 9, header = TRUE)

然后执行（列名中的空格自动转换为点）

unique(sample_data[, "Sample.ID"])

score 1 · Accepted Answer

如果数据在 R 对象中，比如说它被命名为“Lines”，那么您需要将 cafe876 提供的合理解决方案应用于 textConnection 调用或使用在最近版本中添加到 R 的 text= 参数：

samp_dat <-  read.table(file = textConnection(Lines), sep = "\t", skip = 9, header=TRUE)

或者：

samp_dat <- read.table(text= Lines, sep = "\t", skip = 9, header=TRUE)

这是一个测试用例：

Lines <-
c("[Header]                                                                               ", 
"GSGT Version\t1.9.4                                                                     ", 
"Processing Date\t7/6/2012 11:41 AM                                                      ", 
"Content\t\tGS0005701-OPA.opa\tGS0005702-OPA.opa\tGS0005703-OPA.opa\tGS0005704-OPA.opa       ", 
"Num SNPs\t5858                                                                          ", 
"Total SNPs\t5858                                                                        ", 
"Num Samples\t132                                                                        ", 
"Total Samples\t132                                                                      ", 
"[Data]                                                                                 ", 
"SNP Name\tSample ID\tGC Score\tAllele1 - AB\tAllele2 - AB\tChr\tPosition\tGT Score\tX Raw\tY Raw", 
"rs1867749\t106N\t0.8333\tB\tB\t2\t120109057\t0.8333\t301\t378                                   ", 
"rs1397354\t106N\t0.6461\tA\tB\t2\t215118936\t0.6461\t341\t192                                   ", 
"rs2840531\t106N\t0.5922\tB\tB\t1\t2155821\t0.6091\t296\t391                                     ", 
"rs649593\t106N\t0.8709\tA\tB\t1\t37635225\t0.8709\t357\t200"
)

r - 检查一个非常大的制表符分隔文件的唯一值

2 回答 2

Related

Reference