2

这是我第一次使用 read.table 遇到这个问题:对于具有大量列的行条目,read.table 将列条目循环到下一行。

我有一个 .txt 文件,其中包含可变长度且长度不等的行。作为参考,这是我正在阅读的 .txt 文件:http ://www.broadinstitute.org/gsea/msigdb/download_file.jsp?filePath=/resources/msigdb/4.0/c5.bp.v4.0.symbols.gmt

这是我的代码:

tabsep <- gsub("\\\\t", "\t", "\\t")
MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = tabsep)

部分输出:第一列

                                 V1                                                                               V2     V3     V4      V5      V6
1                   TRNA_PROCESSING                  http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING  ADAT1  TRNT1   FARS2
2  REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY   DLC1   ALS2  SLC9A7
3             DNA_METABOLIC_PROCESS            http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS  XRCC5  XRCC4  RAD51C
4     AMINO_SUGAR_METABOLIC_PROCESS    http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS   UAP1   CHIA  GNPDA1
5      BIOPOLYMER_CATABOLIC_PROCESS     http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS   BTRC HNRNPD    USE1
6             RNA_METABOLIC_PROCESS            http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP
7                             INTS6                                                                             LSM5   LSM4   LSM3    LSM1
8                               CRK                                                                                                       
9          GLUCAN_METABOLIC_PROCESS         http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS    GCK   PYGM   GSK3B
10       PROTEIN_POLYUBIQUITINATION       http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION  ERCC8  HUWE1   DZIP3
...

部分输出:最后一列

     V403   V404     V405   V406    V407   V408   V409  V410  V411   V412  V413   V414   V415   V416  V417  V418  V419   V420  V421
1                                                                                                                                  
2   CALCA  CALCB  FAM107A CDK11A RASGRP4 CDK11B   SYN3 GP1BA   TNN   ENO1 PTPRC   MTL5  ISOC2   RHAG   VWF   GPI   HPX SLC5A7   F2R
3                                                                                                                                  
4                                                                                                                                  
5                                                                                                                                  
6    IRF2   IRF3 SLC2A4RG   LSM6   XRCC6  INTS1 HOXD13   RP9 INTS2 ZNF638 INTS3 ZNF254 CITED1 CITED2 INTS9 INTS8 INTS5  INTS4 INTS7
7  POU1F1 TCF7L2 TNFRSF1A  NPAS2   HAND1  HAND2 NUDT21 APEX1  ENO1    ERF  DTX1  SOX30   CBY1   DIS3   SP1   SP2   SP3    SP4  NFIC
8                                                                                                                                  
9                                                                                                                                  
10 

例如,第 6 行的列条目被循环填充以填充第 7 行和第 8 行。对于具有大量列的行条目,我似乎只有这个问题。其他 .txt 文件也会出现这种情况,但会在不同的列号处中断。我检查了发生中断的所有行条目,并且条目中没有异常字符(它们都是标准的大写基因符号)。

我已经尝试了 read.table 和 read.delim ,结果相同。如果我先将 .txt 文件转换为 .csv 并使用相同的代码,则不会出现此问题(请参阅下面的等效输出)。但我不想先将每个文件转换为 .csv,我真的只想了解发生了什么。

如果我转换为 .csv 文件,则输出正确:

MSigDB.collection = read.table(fileName, header = FALSE, fill = TRUE, as.is = TRUE, sep = ",")

                                V1                                                                               V2     V3     V4      V5      V6
1                  TRNA_PROCESSING                  http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING  ADAT1  TRNT1   FARS2  METTL1
2 REGULATION_OF_BIOLOGICAL_QUALITY http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY   DLC1   ALS2  SLC9A7   PTGS2
3            DNA_METABOLIC_PROCESS            http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS  XRCC5  XRCC4  RAD51C   XRCC3
4    AMINO_SUGAR_METABOLIC_PROCESS    http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS   UAP1   CHIA  GNPDA1     GNE
5     BIOPOLYMER_CATABOLIC_PROCESS     http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS   BTRC HNRNPD    USE1 RNASEH1
6            RNA_METABOLIC_PROCESS            http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD SYNCRIP   MED24
7         GLUCAN_METABOLIC_PROCESS         http://www.broadinstitute.org/gsea/msigdb/cards/GLUCAN_METABOLIC_PROCESS    GCK   PYGM   GSK3B   EPM2A
8       PROTEIN_POLYUBIQUITINATION       http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_POLYUBIQUITINATION  ERCC8  HUWE1   DZIP3    DDB2
9          PROTEIN_OLIGOMERIZATION          http://www.broadinstitute.org/gsea/msigdb/cards/PROTEIN_OLIGOMERIZATION   SYT1   AASS    TP63   HPRT1
4

1 回答 1

5

详细说明我的评论...

从帮助页面到read.table

数据列的数量是通过查看输入的前五行(或如果文件少于五行,则为整个文件)确定的,或者根据指定的长度col.names确定并且更长。fill如果或blank.lines.skip为真,这可能是错误的,因此col.names请在必要时指定(如“示例”中所示)。


要使用未知数据集解决此问题,请使用count.fields确定文件中分隔符的数量,并使用它来创建col.namesread.table使用:

x <- max(count.fields("~/Downloads/c5.bp.v4.0.symbols.gmt", "\t"))
Names <- paste("V", sequence(x), sep = "")
y <- read.table("~/Downloads/c5.bp.v4.0.symbols.gmt", col.names=Names, sep = "\t", fill = TRUE)

检查前几行。我将把实际的全面检查留给你。

y[1:6, 1:10]
#                                 V1
# 1                  TRNA_PROCESSING
# 2 REGULATION_OF_BIOLOGICAL_QUALITY
# 3            DNA_METABOLIC_PROCESS
# 4    AMINO_SUGAR_METABOLIC_PROCESS
# 5     BIOPOLYMER_CATABOLIC_PROCESS
# 6            RNA_METABOLIC_PROCESS
#                                                                                 V2     V3     V4
# 1                  http://www.broadinstitute.org/gsea/msigdb/cards/TRNA_PROCESSING  ADAT1  TRNT1
# 2 http://www.broadinstitute.org/gsea/msigdb/cards/REGULATION_OF_BIOLOGICAL_QUALITY   DLC1   ALS2
# 3            http://www.broadinstitute.org/gsea/msigdb/cards/DNA_METABOLIC_PROCESS  XRCC5  XRCC4
# 4    http://www.broadinstitute.org/gsea/msigdb/cards/AMINO_SUGAR_METABOLIC_PROCESS   UAP1   CHIA
# 5     http://www.broadinstitute.org/gsea/msigdb/cards/BIOPOLYMER_CATABOLIC_PROCESS   BTRC HNRNPD
# 6            http://www.broadinstitute.org/gsea/msigdb/cards/RNA_METABOLIC_PROCESS HNRNPF HNRNPD
#        V5      V6         V7    V8     V9   V10
# 1   FARS2  METTL1       SARS  AARS  THG1L   SSB
# 2  SLC9A7   PTGS2      PTGS1 MPV17  SGMS1 AGTR1
# 3  RAD51C   XRCC3      XRCC2 XRCC6  ISG20 PRIM1
# 4  GNPDA1     GNE CSGALNACT1 CHST2  CHST4 CHST5
# 5    USE1 RNASEH1     RNF217 ISG20 CDKN2A  CPA2
# 6 SYNCRIP   MED24       RORB MED23   REST MED21
nrow(y)
# [1] 825

对于那些不想下载其他文件来尝试的人来说,这是一个最小的示例。

创建一个 6 行 CSV 文件,其中最后一行的字段比前 5 行多,并尝试read.table在其上使用:

cat("1,2,3,4", "1,2,3,4", "1,2,3,4", "1,2,3,4", 
    "1,2,3,4", "1,2,3,4,5", file = "test1.txt", 
    sep = "\n")
read.table("test1.txt", header = FALSE, sep = ",", fill = TRUE)
#   V1 V2 V3 V4
# 1  1  2  3  4
# 2  1  2  3  4
# 3  1  2  3  4
# 4  1  2  3  4
# 5  1  2  3  4
# 6  1  2  3  4
# 7  5 NA NA NA

请注意与最长的行是否在文件的前五行中的区别:

cat("1,2,3,4", "1,2,3,4,5", "1,2,3,4", "1,2,3,4", 
    "1,2,3,4", "1,2,3,4", file = "test2.txt", 
    sep = "\n")
read.table("test2.txt", header = FALSE, sep = ",", fill = TRUE)
#   V1 V2 V3 V4 V5
# 1  1  2  3  4 NA
# 2  1  2  3  4  5
# 3  1  2  3  4 NA
# 4  1  2  3  4 NA
# 5  1  2  3  4 NA
# 6  1  2  3  4 NA

为了解决这个问题,我们使用count.fieldswhich 返回每行中检测到的字段数的向量。我们从中获取max并将其传递给 的col.names参数read.table

x <- count.fields("test1.txt", sep=",")
x
# [1] 4 4 4 4 4 5
read.table("test.txt", header = FALSE, sep = ",", fill = TRUE,
           col.names = paste("V", sequence(max(x)), sep = ""))
#   V1 V2 V3 V4 V5
# 1  1  2  3  4 NA
# 2  1  2  3  4 NA
# 3  1  2  3  4 NA
# 4  1  2  3  4 NA
# 5  1  2  3  4 NA
# 6  1  2  3  4  5
于 2013-09-14T03:07:41.637 回答