68

我想使用 for 循环和文件的长度逐行读取 R 中的文本文件。问题是它只打印字符(0)。这是代码:

fileName="up_down.txt"
con=file(fileName,open="r")
line=readLines(con) 
long=length(line)
for (i in 1:long){
    linn=readLines(con,1)
    print(linn)
}
close(con)
4

6 回答 6

143

您应该小心处理readLines(...)大文件。读取内存中的所有行可能是有风险的。下面是一个如何读取文件并一次只处理一行的示例:

processFile = function(filepath) {
  con = file(filepath, "r")
  while ( TRUE ) {
    line = readLines(con, n = 1)
    if ( length(line) == 0 ) {
      break
    }
    print(line)
  }

  close(con)
}

也要了解在内存中读取一行的风险。没有换行符的大文件也可以填满你的记忆。

于 2016-03-03T01:02:23.223 回答
51

只需readLines在您的文件上使用:

R> res <- readLines(system.file("DESCRIPTION", package="MASS"))
R> length(res)
[1] 27
R> res
 [1] "Package: MASS"                                                                  
 [2] "Priority: recommended"                                                          
 [3] "Version: 7.3-18"                                                                
 [4] "Date: 2012-05-28"                                                               
 [5] "Revision: $Rev: 3167 $"                                                         
 [6] "Depends: R (>= 2.14.0), grDevices, graphics, stats, utils"                      
 [7] "Suggests: lattice, nlme, nnet, survival"                                        
 [8] "Authors@R: c(person(\"Brian\", \"Ripley\", role = c(\"aut\", \"cre\", \"cph\"),"
 [9] "        email = \"ripley@stats.ox.ac.uk\"), person(\"Kurt\", \"Hornik\", role"  
[10] "        = \"trl\", comment = \"partial port ca 1998\"), person(\"Albrecht\","   
[11] "        \"Gebhardt\", role = \"trl\", comment = \"partial port ca 1998\"),"     
[12] "        person(\"David\", \"Firth\", role = \"ctb\"))"                          
[13] "Description: Functions and datasets to support Venables and Ripley,"            
[14] "        'Modern Applied Statistics with S' (4th edition, 2002)."                
[15] "Title: Support Functions and Datasets for Venables and Ripley's MASS"           
[16] "License: GPL-2 | GPL-3"                                                         
[17] "URL: http://www.stats.ox.ac.uk/pub/MASS4/"                                      
[18] "LazyData: yes"                                                                  
[19] "Packaged: 2012-05-28 08:47:38 UTC; ripley"                                      
[20] "Author: Brian Ripley [aut, cre, cph], Kurt Hornik [trl] (partial port"          
[21] "        ca 1998), Albrecht Gebhardt [trl] (partial port ca 1998), David"        
[22] "        Firth [ctb]"                                                            
[23] "Maintainer: Brian Ripley <ripley@stats.ox.ac.uk>"                               
[24] "Repository: CRAN"                                                               
[25] "Date/Publication: 2012-05-28 08:53:03"                                          
[26] "Built: R 2.15.1; x86_64-pc-mingw32; 2012-06-22 14:16:09 UTC; windows"           
[27] "Archs: i386, x64"                                                               
R> 

有一整本手册专门用于说明这一点。

于 2012-09-27T17:13:31.777 回答
44

这是带有for循环的解决方案。重要的是,它将一次调用readLines排除在 for 循环之外,这样就不会一次又一次地不正确地调用它。这里是:

fileName <- "up_down.txt"
conn <- file(fileName,open="r")
linn <-readLines(conn)
for (i in 1:length(linn)){
   print(linn[i])
}
close(conn)
于 2012-09-27T17:56:03.220 回答
4

我编写了一个代码来逐行读取文件以满足我的需求,不同的行具有不同的数据类型,请遵循文章:read-line-by-line-of-a-file-in-rdetermine-number-of-linesrecords。我认为它应该是大文件的更好解决方案。我的 R 版本(3.3.2)。

con = file("pathtotargetfile", "r")
readsizeof<-2    # read size for one step to caculate number of lines in file
nooflines<-0     # number of lines
while((linesread<-length(readLines(con,readsizeof)))>0)    # calculate number of lines. Also a better solution for big file
  nooflines<-nooflines+linesread

con = file("pathtotargetfile", "r")    # open file again to variable con, since the cursor have went to the end of the file after caculating number of lines
typelist = list(0,'c',0,'c',0,0,'c',0)    # a list to specific the lines data type, which means the first line has same type with 0 (e.g. numeric)and second line has same type with 'c' (e.g. character). This meet my demand.
for(i in 1:nooflines) {
  tmp <- scan(file=con, nlines=1, what=typelist[[i]], quiet=TRUE)
  print(is.vector(tmp))
  print(tmp)
}
close(con)
于 2017-02-09T06:12:46.250 回答
1

我建议你检查chunkeddisk.frame。它们都具有逐块读取 CSV 的功能。

特别是,disk.frame::csv_to_disk.frame可能是您追求的功能?

于 2018-11-01T22:37:33.017 回答
0
fileName = "up_down.txt"

### code to get the line count of the file
length_connection = pipe(paste("cat ", fileName, " | wc -l", sep = "")) # "cat fileName | wc -l" because that returns just the line count, and NOT the name of the file with it
long = as.numeric(trimws(readLines(con = length_connection, n = 1)))
close(length_connection) # make sure to close the connection
###

for (i in 1:long){

    ### code to extract a single line at row i from the file
    linn_connection_cmd = paste("head -n", format(x = i, scientific = FALSE, big.mark = ""), fileName, "| tail -n 1", sep = " ") # extracts one line from fileName at the desired line number (i)
    linn_connection = pipe(linn_connection_cmd)
    linn = readLines(con = linn_connection, n = 1)
    close(linn_connection) # make sure to close the conection
    ###
    
    # the line is now loaded into R and anything can be done with it
    print(linn)
}
close(con)

通过使用 R 的pipe()命令,并使用 shell 命令来提取我们想要的内容,完整的文件永远不会加载到 R 中,而是逐行读取。

paste("head -n", format(x = i, scientific = FALSE, big.mark = ""), fileName, "| tail -n 1", sep = " ")

正是这个命令完成了所有工作;它从所需文件中提取一行。

编辑:R 的默认行为是在小于100,000i时以正常数字返回,但在大于或等于100,000 (1e+05)时开始以科学计数法返回。因此,在我们的管道命令中使用,以确保命令始终接收正常形式的数字,这是命令可以理解的全部。如果该命令被赋予任何数字,如 1e+05,它将无法理解,并会导致以下错误:iformat(x = i, scientific = FALSE, big.mark = "")pipe()pipe()

head: 1e+05: invalid number of lines
于 2021-06-23T17:51:55.413 回答