1

我正在尝试使用该read.table命令将 CSV 文件导入 R。我不断收到错误消息“列多于列名”,即使我已将 strip.white 设置为 TRUE。制作 csv 文件的程序在每一行的末尾添加了大量的逗号字符,我认为这是额外列的来源。

read.table("filename.csv", sep=",", fill=T, header=TRUE, strip.white = T, 
           as.is=T,row.names = NULL, quote = "")

当 R 将其读入 R 控制台时,如何让 R 从标题行和 CSV 文件的其余部分中去除无关的逗号列?

此外,csv 文件中的许多单元格不包含任何数据。是否可以让 R 用“NA”填充这些空单元格?

csv 文件的前两行:

Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
Chr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4

2 回答 2

4

您可以使用 colClasses 与“NULL”条目的组合来“删除”逗号(还需要, fill=TRUE

read.table(text="1,2,3,4,5,6,7,8,,,,,,,,,,,,,,,,,,
 9,9,9,9,9,9,9,9,,,,,,,,,,,,,,,,,", sep=",", fill=TRUE, colClasses=c(rep("numeric", 8), rep("NULL", 30)) )
#------------------
  V1 V2 V3 V4 V5 V6 V7 V8
1  1  2  3  4  5  6  7  8
2  9  9  9  9  9  9  9  9
Warning message:
In read.table(text = "1,2,3,4,5,6,7,8,,,,,,,,,,,,,,,,,,\n9,9,9,9,9,9,9,9,,,,,,,,,,,,,,,,,",  :
  cols = 26 != length(data) = 38

我需要在第一行末尾添加缺少的换行符。(您应该编辑问题而不是在评论中放置数据示例的另一个原因。)标题中有一个 octothorpe,需要将comment.char其设置为“”:

read.table(text="Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,\nChr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", header=TRUE, colClasses=c(rep("character", 24), rep("NULL", 41)), comment.char="", sep=",")

  Document_Name Sequence_Name Track_Name Type           Name
1       Chr2_FT          Chr2   Chr2.bed  CDS 10000_ARHGAP15
                                                        Sequence  Minimum Min_.with_gaps...  Maximum
1 GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG 55916421          56019336 55916483
  Max_.with_gaps. Length Length_.with_gaps. X._Intervals Direction Average.._Quality Coverage modified_by
1        56019399     63                 64            1   forward                                     U‌​ser
  Polymorphism_Type Strand.Bias Strand.Bias_.50._P.va..lue Strand.Bias_.65._P.value Variant_Frequency
1                                                                                                    
  Variant_Nucleotide.s. Variant_P.Va..lue_.approximate.
1                                                      

如果您知道您的 colClasses 将是什么,那么您可以自动在数字列中将缺失值设为 NA。您也可以使用该na.strings设置来完成此操作。您还可以对标题进行一些编辑以删除列名中的非法字符。(不过,我认为我不需要成为那个人。)

read.table(text="Document_Name,Sequence_Name,Track_Name,Type,Name,Sequence,Minimum,Min_(with_gaps‌​),Maximum,Max_(with_gaps),Length,Length_(with_gaps),#_Intervals,Direction,Average‌​_Quality,Coverage,modified_by,Polymorphism_Type,Strand-Bias,Strand-Bias_>50%_P-va‌​lue,Strand-Bias_>65%_P-value,Variant_Frequency,Variant_Nucleotide(s),Variant_P-Va‌​lue_(approximate),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
 Chr2_FT,Chr2,Chr2.bed,CDS,10000_ARHGAP15,GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG,55916421,56019336,55916483,56019399,63,64,1,forward,,,U‌​ser,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,", header=TRUE, colClasses=c(rep("character", 24), rep("NULL", 41)), comment.char="", sep=",", na.strings="")
#------------------------------------------------------
  Document_Name Sequence_Name Track_Name Type           Name
1       Chr2_FT          Chr2   Chr2.bed  CDS 10000_ARHGAP15
                                                        Sequence  Minimum Min_.with_gaps...  Maximum
1 GAAAGAATCATTAACAGTTAGAAGTTGATG-AAGTTTCA‌​ATAACAAGTGGGCACTGAGAGAAAG 55916421          56019336 55916483
  Max_.with_gaps. Length Length_.with_gaps. X._Intervals Direction Average.._Quality Coverage modified_by
1        56019399     63                 64            1   forward              <NA>     <NA>          U‌​ser
  Polymorphism_Type Strand.Bias Strand.Bias_.50._P.va..lue Strand.Bias_.65._P.value Variant_Frequency
1              <NA>        <NA>                       <NA>                     <NA>              <NA>
  Variant_Nucleotide.s. Variant_P.Va..lue_.approximate.
1                  <NA>                            <NA>
于 2012-12-21T20:32:31.173 回答
2

我一直在摆弄你文件的前两行,问题似乎出#在你的一个列名中。read.table默认情况下被视为#注释字符,因此它会读取您的标题,忽略后面的所有内容#并返回 13 列。

您将能够read.table使用参数读取文件comment.char=""

顺便说一句,这也是提问者应该包括他们正在使用的文件/数据集示例的另一个原因。

于 2012-12-22T03:20:23.867 回答