r - CSV 到 Vowpal 输入格式 - 优化慢 R 代码

Question

我正在尝试构建一个快速的 CSV 到 Vowpal 输入格式转换器。我发现了一些与libsvm相关的好代码，并以此为基础。它适用于提供的小型泰坦尼克号数据集，但我的真实数据集超过 450 万。具有 200 多个特征的观察结果。在功能强大的服务器上提供代码需要三天时间。

有没有办法在这里删除单循环？请记住，Vowpal 有自己的稀疏性，因此代码需要每次检查索引以排除每行的 0 或 NA。（与数据框不同，vowpal 不需要在每行中保留相同数量的特征）。我可以将每一行写入文件而不是将其全部保存在内存中。任何解决方案将不胜感激！

# sample data set
titanicDF <- read.csv('http://math.ucdenver.edu/RTutorial/titanic.txt',sep='\t')
titanicDF  <- titanicDF  [c("PClass", "Age", "Sex", "Survived")]

# target variable
y <- titanicDF$Survived
lineHolders <- c()
for ( i in 1:nrow( titanicDF  )) {

    # find indexes of nonzero values - anything 
    # with zero for that row needs to be ignored!
    indexes = which( as.logical( titanicDF [i,] ))
    indexes <- names(titanicDF [indexes])

    # nonzero values
    values = titanicDF [i, indexes]

    valuePairs = paste( indexes, values, sep = ":", collapse = " " )

    # add label in the front and newline at the end
    output_line = paste0(y[i], " |f ", valuePairs, "\n", sep = "" )

    lineHolders <- c(lineHolders, output_line)
}

score 1 · Accepted Answer

解决关于行循环的原始问题，在某种程度上，通过数据框列而不是行来处理它似乎更快。我已将您的代码放在名为 func_Row 的函数中，如下所示

func_Row  <-  function(titanicDF) {
# target variable
y <- titanicDF$Survived
lineHolders <- c()
for ( i in 1:nrow( titanicDF  )) {
# find indexes of nonzero values - anything 
# with zero for that row needs to be ignored!
 indexes = which( as.logical( titanicDF [i,] ))
 indexes <- names(titanicDF [indexes])
# nonzero values
 values = titanicDF [i, indexes]
 valuePairs = paste( indexes, values, sep = ":", collapse = " " )
# add label in the front and newline at the end
 output_line = paste0(y[i], " |f ", valuePairs, "\n", sep = "" )
 lineHolders <- c(lineHolders, output_line)
} 
return(lineHolders)
}

并将另一个按列处理的功能放在一起

 func_Col <- function(titanicDF) {
 lineHolders <- paste(titanicDF$Survived, "|f")
 for( ic in 1:ncol(titanicDF)) {
   nonzeroes <- which(as.logical(as.numeric(titanicDF[,ic]))) 
   lineHolders[nonzeroes] <- paste(lineHolders[nonzeroes]," ",names(titanicDF)[ic], ":", as.numeric(titanicDF[nonzeroes,ic]),sep="") 
 }
 lineHolders <- paste(lineHolders,"\n",sep="")
 return(lineHolders)
 }

使用 microbenchmark 比较这两个函数得到以下结果

microbenchmark( func_Row(titanicDF), func_Col(titanicDF), times=10)
Unit: milliseconds
            expr        min         lq     median         uq       max neval
func_Row(titanicDF) 370.396605 375.210624 377.044896 385.097586 443.14042    10
func_Col(titanicDF)   6.626192   6.661266   6.675667   6.798711  10.31897    10

请注意，这组数据的结果以毫秒为单位。因此，按列处理比按行处理快约 50 倍。通过读取行块中的数据来解决内存问题并保留按列处理的好处是相当简单的。我根据泰坦尼克号数据创建了一个 5,300,000 行文件，如下所示

titanicDF_big <- titanicDF
for( i in 1:12 )  titanicDF_big <- rbind(titanicDF_big, titanicDF_big)
write.table(titanicDF_big, "titanic_big.txt", row.names=FALSE )

然后可以使用以下函数以行块的形式读取此文件

read_blocks <- function(file_name, row_max = 6000000L, row_block = 5000L ) {
#   Version of code using func_Col to process data by columns
blockDF = NULL
for( row_num in seq(1, row_max, row_block)) { 
  if( is.null(blockDF) )  {
    blockDF <- read.table(file_name, header=TRUE, nrows=row_block)
    lineHolders <- func_Col(blockDF)
  }  
  else  {
    blockDF <- read.table(file_name, header=FALSE, col.names=names(blockDF),
                            nrows=row_block, skip = row_num - 1)
    lineHolders <- c(lineHolders, func_Col(blockDF))
  }
}
return(lineHolders)
}

下面给出了使用此版本的 read_blocks 的基准测试结果，该版本使用 func_Col 按列处理数据，以读取整个扩展的泰坦尼克号数据文件，块大小从 500,000 行到 2,000,000 行不等：

Unit: seconds
                                                                 expr      min       lq       median       uq      max neval
 read_blocks("titanic_big.txt", row_max = 6000000L, row_block = 2000000L) 39.43244 39.43244 39.43244 39.43244 39.43244     1
 read_blocks("titanic_big.txt", row_max = 6000000L, row_block = 1000000L) 46.66375 46.66375 46.66375 46.66375 46.66375     1
 read_blocks("titanic_big.txt", row_max = 6000000L, row_block = 500000L) 62.51387 62.51387 62.51387 62.51387 62.51387     1

更大的块大小提供了明显更好的时间，但需要更多的内存。然而，这些结果表明，通过按列处理数据，即使块大小约为文件大小的 10%，也可以在大约一分钟或更短的时间内读取整个 530 万行扩展的泰坦尼克号数据文件。同样，结果将取决于数据列数和系统属性。

r - CSV 到 Vowpal 输入格式 - 优化慢 R 代码

1 回答 1

Related

Reference