r - R：在不读取全部内容的情况下预测 csv 文件的类型

Question

我需要读取一个csv文件并根据文件类型处理文件内容，无论是逗号分隔文件还是制表符分隔文件。我正在使用下面的代码，但是效率很低，因为如果输入文件是逗号分隔的文件，我需要读取两次文件。我使用的代码如下：

readFile <- function(fileName){
  portData <- read.csv(fileName,sep="\t")
  if(length(portData) == 1){
    print("comma separated file")
    executeCommaSepFile(fileName)
  }
  else{
    print("tab separated file")
    #code to process the tab separated file
  }
}
executeCommaSepFile <- function(fileName){
  csvData <- read.csv(file=fileName, colClasses=c(NA, NA,"NULL",NA,"NULL",NA,"NULL","NULL","NULL"))
  #code to process the comma separated file
}

是否可以在不读取文件的全部内容的情况下预测文件的类型？或者如果我通过portData而不是fileName，我会executeCommaSepFile()以这种格式获取里面的数据：

   RUS1000.01.29.1999.21st.Centy.Ins.Group.TW.Z.90130N10.72096.1527.534.0.01.21.188
1           RUS1000,01/29/1999,3com Corp,COMS,88553510,358764,16861.908,0.16,47.000
2                RUS1000,01/29/1999,3m Co,MMM,88579Y10,401346,31154.482,0.29,77.625
3 RUS1000,01/29/1999,A D C Telecommunicat,ADCT,00088630,135114,5379.226,0.05,39.813
4         RUS1000,01/29/1999,Abbott Labs,ABT,00282410,1517621,70474.523,0.66,46.438

这是否可以转换为read.csv(file=fileName, colClasses=c(NA, NA,"NULL",NA,"NULL",NA,"NULL","NULL","NULL")) 的格式？即，以这种格式：

   RUS1000 X01.29.1999 TW.Z  X72096
1  RUS1000  01/29/1999 COMS  358764
2  RUS1000  01/29/1999  MMM  401346
3  RUS1000  01/29/1999 ADCT  135114
4  RUS1000  01/29/1999  ABT 1517621

score 2 · Accepted Answer

portData <- read.csv(fileName,sep="\t")
if(length(portData) == 1) {
    print("comma separated file")
    dat <- read.csv(textConnection(portData))
    executeCommaSepFile(dat)  # pass the data frame, not the filename
}
else {
    print("tab separated file")
    #code to process the tab separated file
}

score 1 · Accepted Answer

如果继续使用基础 R，您至少有两个选择。

读入文件的一小段（nrows参数read.table和朋友）：

portData <- read.csv(fileName,sep="\t", nrows=1)
if(length(portData) == 1) {
    print("comma separated file")
    executeCommaSepFile(fileName)
}
else {
    print("tab separated file")
    executeTabSepFile(fileName) # run read.table in here
}

读入整个文件，如果它不起作用，请使用textConnection以避免返回磁盘（效率不高，但它有效）：

portData <- read.csv(fileName,sep="\t")
if(length(portData) == 1) {
    print("comma separated file")
    dat <- read.csv(textConnection(portData))
    executeCommaSepFile(dat)  # pass the data frame, not the filename
}
else {
    print("tab separated file")
    #code to process the tab separated file
}

r - R：在不读取全部内容的情况下预测 csv 文件的类型

2 回答 2

Related

Reference