我如何读/写libsvm
数据到/从R
?
libsvm
格式是稀疏数据,如
<class/target>[ <attribute number>:<attribute value>]*
(参见压缩行存储(CRS))例如,
1 10:3.4 123:0.5 34567:0.231
0.2 22:1 456:03
我确信我可以自己鞭打一些东西,但我更愿意使用现成的东西。但是,R
库foreign
似乎没有提供必要的功能。
我如何读/写libsvm
数据到/从R
?
libsvm
格式是稀疏数据,如
<class/target>[ <attribute number>:<attribute value>]*
(参见压缩行存储(CRS))例如,
1 10:3.4 123:0.5 34567:0.231
0.2 22:1 456:03
我确信我可以自己鞭打一些东西,但我更愿意使用现成的东西。但是,R
库foreign
似乎没有提供必要的功能。
e1071
已下架:install.packages("e1071")
library(e1071)
read.matrix.csr(...)
write.matrix.csr(...)
注意:它是在 中实现的R
,而不是在 中实现的C
,所以它是dog-slow。
它甚至有一个特殊的小插图支持向量机——包 e1071 中的 libsvm 接口。
r.vw
与vowpal_wabbit
注意:它是在 中实现的R
,而不是在 中实现的C
,所以它是dog-slow。
我已经使用 zygmuntz 解决方案在具有 25k 观察(行)的数据集上运行了将近 5 个小时的工作。它已经完成了 3k 行。在此期间我花了很长时间编写了这个代码(基于 zygmuntz 的代码):
require(Matrix)
read.libsvm = function( filename ) {
content = readLines( filename )
num_lines = length( content )
tomakemat = cbind(1:num_lines, -1, substr(content,1,1))
# loop over lines
makemat = rbind(tomakemat,
do.call(rbind,
lapply(1:num_lines, function(i){
# split by spaces, remove lines
line = as.vector( strsplit( content[i], ' ' )[[1]])
cbind(i, t(simplify2array(strsplit(line[-1],
':'))))
})))
class(makemat) = "numeric"
#browser()
yx = sparseMatrix(i = makemat[,1],
j = makemat[,2]+2,
x = makemat[,3])
return( yx )
}
这在同一台机器上运行了几分钟(zygmuntz 解决方案也可能存在内存问题,不确定)。希望这可以帮助任何有同样问题的人。
请记住,如果您需要在 R 中进行大量计算,请使用 VECTORIZE!
编辑:修复了我今天早上发现的索引错误。
我利用一些实用程序提出了自己的临时解决方案,data.table
它几乎很快就在我找到的测试数据集(波士顿住房数据)上运行。
将其转换为data.table
(与解决方案正交,但在此处添加以便于重现):
library(data.table)
x = fread("/media/data_drive/housing.data.fw",
sep = "\n", header = FALSE)
#usually fixed-width conversion is harder, but everything here is numeric
columns = c("CRIM", "ZN", "INDUS", "CHAS",
"NOX", "RM", "AGE", "DIS", "RAD",
"TAX", "PTRATIO", "B", "LSTAT", "MEDV")
DT = with(x, fread(paste(gsub("\\s+", "\t", V1), collapse = "\n"),
header = FALSE, sep = "\t",
col.names = columns))
这里是:
DT[ , fwrite(as.data.table(paste0(
MEDV, " | ", sapply(transpose(lapply(
names(.SD), function(jj)
paste0(jj, ":", get(jj)))),
paste, collapse = " "))),
"/path/to/output", col.names = FALSE, quote = FALSE),
.SDcols = !"MEDV"]
#what gets sent to as.data.table:
#[1] "24 | CRIM:0.00632 ZN:18 INDUS:2.31 CHAS:0 NOX:0.538 RM:6.575
# AGE:65.2 DIS:4.09 RAD:1 TAX:296 PTRATIO:15.3 B:396.9 LSTAT:4.98 MEDV:24"
#[2] "21.6 | CRIM:0.02731 ZN:0 INDUS:7.07 CHAS:0 NOX:0.469 RM:6.421
# AGE:78.9 DIS:4.9671 RAD:2 TAX:242 PTRATIO:17.8 B:396.9 LSTAT:9.14 MEDV:21.6"
# ...
可能有一种更好的方法可以让fwrite
than理解这一点as.data.table
,但我想不出一个(直到setDT
对向量起作用)。
我复制了这个以测试它在更大数据集上的性能(只是炸毁当前数据集):
DT2 = rbindlist(replicate(1000, DT, simplify = FALSE))
与此处报告的某些时间相比,该操作非常快(我还没有直接比较):
system.time(.)
# user system elapsed
# 8.392 0.000 8.385
我也测试了使用writeLines
而不是fwrite
,但后者更好。
我再看一遍,发现可能需要一段时间才能弄清楚发生了什么。也许magrittr
-piped 版本会更容易理解:
DT[ ,
#1) prepend each column's values with the column name
lapply(names(.SD), function(jj)
paste0(jj, ":", get(jj))) %>%
#2) transpose this list (using data.table's fast tool)
# (was column-wise, now row-wise)
#3) concatenate columns, separated by " "
transpose %>% sapply(paste, collapse = " ") %>%
#4) prepend each row with the target value
# (with Vowpal Wabbit in mind, separate with a pipe)
paste0(MEDV, " | ", .) %>%
#5) convert this to a data.table to use fwrite
as.data.table %>%
#6) fwrite it; exclude nonsense column name,
# and force quotes off
fwrite("/path/to/data",
col.names = FALSE, quote = FALSE),
.SDcols = !"MEDV"]
读取此类文件要容易得多**
#quickly read data; don't split within lines
x = fread("/path/to/data", sep = "\n", header = FALSE)
#tstrsplit is transpose(strsplit(.))
dt1 = x[ , tstrsplit(V1, split = "[| :]+")]
#even columns have variable names
nms = c("target_name",
unlist(dt1[1L, seq(2L, ncol(dt1), by = 2L),
with = FALSE]))
#odd columns have values
DT = dt1[ , seq(1L, ncol(dt1), by = 2L), with = FALSE]
#add meaningful names
setnames(DT, nms)
**这不适用于“参差不齐”/稀疏的输入数据。我认为没有办法将其扩展到在这种情况下工作。
将 data.frame 写入 svm light 格式的函数。我添加了一个 train={TRUE, FALSE} 参数,以防数据没有标签。在这种情况下,类索引被忽略。
write.libsvm = function(data, filename= "out.dat", class = 1, train=TRUE) {
out = file(filename)
if(train){
writeLines(apply(data, 1, function(X) {
paste(X[class],
apply(cbind(which(X!=0)[-class],
X[which(X!=0)[-class]]),
1, paste, collapse=":"),
collapse=" ")
}), out)
} else {
# leaves 1 as default for the new data without predictions.
writeLines(apply(data, 1, function(X) {
paste('1',
apply(cbind(which(X!=0), X[which(X!=0)]), 1, paste, collapse=":"),
collapse=" ")
}), out)
}
close(out)
}
** 编辑 **
library(data.table)
data.table.fm <- function (data = X, fileName = "../out.fm", target = "y_train",
train = TRUE) {
if (train) {
if (is.logical(data[[target]]) | sum(levels(factor(data[[target]])) ==
levels(factor(c(0, 1)))) == 2) {
data[[target]][data[[target]] == TRUE] = 1
data[[target]][data[[target]] == FALSE] = -1
}
}
specChar = "\\(|\\)|\\||\\:"
specCharSpace = "\\(|\\)|\\||\\:| "
parsingNames <- function(x) {
ret = c()
for (el in x) ret = append(ret, gsub(specCharSpace, "_",
el))
ret
}
parsingVar <- function(x, keepSpace, hard_parse) {
if (!keepSpace)
spch = specCharSpace
else spch = specChar
if (hard_parse)
gsub("(^_( *|_*)+)|(^_$)|(( *|_*)+_$)|( +_+ +)",
" ", gsub(specChar, "_", gsub("(^ +)|( +$)",
"", x)))
else gsub(spch, "_", x)
}
setnames(data, names(data), parsingNames(names(data)))
target = parsingNames(target)
format_vw <- function(column, formater) {
ifelse(as.logical(column), sprintf(formater, j, column),
"")
}
all_vars = names(data)[!names(data) %in% target]
cat("Reordering data.table if class isn't first\n")
target_inx = which(names(data) %in% target)
rest_inx = which(!names(data) %in% target)
cat("Adding Variable names to data.table\n")
for (j in rest_inx) {
column = data[[j]]
formater = "%s:%f"
set(data, i = NULL, j = j, value = format_vw(column,
formater))
cat(sprintf("Fixing %s\n", j))
}
data = data[, c(target_inx, rest_inx), with = FALSE]
drop_extra_space <- function(x) {
gsub(" {1,}", " ", x)
}
cat("Pasting data - Removing extra spaces\n")
data = apply(data, 1, function(x) drop_extra_space(paste(x,
collapse = " ")))
cat("Writing to disk\n")
write.table(data, file = fileName, sep = " ", row.names = FALSE,
col.names = FALSE, quote = FALSE)
}
我采用了两跳解决方案——先将 R 数据转换为另一种格式,然后再转换为 LIBSVM:
我的数据集是 200K x 500,这只需要 3-5 分钟。
这个问题是很久以前提出的,有几个答案。大多数答案对我不起作用,因为我的数据格式很长,而且我不能在 R 中一次性对其进行编码。所以这是我的看法。我编写了一个函数来对数据进行一次热编码,并保存它,而无需先将矩阵转换为稀疏矩阵。
RCPP代码:
// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
#include <Rcpp.h>
#include <iostream>
#include <fstream>
#include <string>
using namespace Rcpp;
// Reading data frame from R and saving it as an libFM file
// [[Rcpp::export]]
std::string createNumber(int x, double y) {
std::string s1 = std::to_string(x);
std::string s2 = std::to_string(y);
std::string X_elem = s1 + ":" + s2;
return X_elem;
}
// [[Rcpp::export]]
std::string createRowLibFM(arma::rowvec row_to_fm, arma::vec factor_levels, arma::vec position) {
int n = factor_levels.n_elem;
std::string total = std::to_string(row_to_fm[0]);
for (int i = 1; i < n; i++) {
if (factor_levels[i] > 1) {
total = total + " " + createNumber(position[i - 1] + row_to_fm[i], 1);
}
if (factor_levels[i] == 1) {
total = total + " " + createNumber(position[i], row_to_fm[i]);
}
}
return total;
}
// [[Rcpp::export]]
void writeFile(std::string file, arma::mat all_data, arma::vec factor_levels) {
int n = all_data.n_rows;
arma::vec position = arma::cumsum(factor_levels);
std::ofstream temp_file;
temp_file.open (file.c_str());
for (int i = 0; i < n; i++) {
std::string temp_row = createRowLibFM(all_data.row(i), factor_levels, position);
temp_file << temp_row + "\n";
}
temp_file.close();
}
R函数充当它的包装器:
writeFileFM <- function(temp.data, path = 'test.txt') {
### Dealing with y function
if (!(any(colnames(temp.data) %in% 'y'))) {
stop('No y column is given')
} else {
temp.data <- temp.data %>% select(y, everything()) ## y is required to be first column for writeFile
}
### Dealing with factors/strings
temp.classes <- sapply(temp.data, class)
class.num <- rep(0, length(temp.classes))
map.list <- list()
for (i in 2:length(temp.classes)) { ### since y is always the first column
if (any(temp.classes[i] %in% c('factor', 'character'))) {
temp.col <- as.factor(temp.data[ ,i]) ### incase it is character
temp.unique <- levels(temp.col)
factors.new <- seq(0, length(temp.unique) - 1, 1)
levels(temp.col) <- factors.new
temp.data[ ,i] <- temp.col
### Saving changes
class.num[i] <- length(temp.unique)
map.list[[i - 1]] <- data.frame('original.value' = temp.unique,
'transform.value' = factors.new)
} else {
class.num[i] <- 1 ### Numeric values require only 1 column
}
}
### Writing file
print('Writing file to disc')
writeFile(all_data = sapply(temp.data, as.numeric), file = path, factor_levels = class.num)
return(map.list)
}
将其与虚假数据进行比较。
### Creating data to save
set.seed(999)
n <- 10000
factor.lvl1 <- 3
factor.lvl2 <- 2
temp.data <- data.frame('x1' = sample(stri_rand_strings(factor.lvl1, 7), n, replace = TRUE),
'x2' = sample(stri_rand_strings(factor.lvl2, 4), n, replace = TRUE),
'x3' = rnorm(n),
'x4' = rnorm(n),
'y' = rnorm(n))
### Comparing to other method
library(data.table)
library(e1071)
microbenchmark::microbenchmark(
temp.data.table <- model.matrix( ~ 0 + x1 + x2 + x3 + x4, data = temp.data,
contrasts = list(x2 = contrasts(temp.data$x2, contrasts = FALSE))),
write.matrix.csr(temp.data.table, 'out.txt'),
writeFileFM(temp.data))
结果。
min lq mean median uq
1.3061 1.6725 1.890942 1.92475 2.07725
629.9863 653.4345 676.108548 672.52510 687.88330
270.8217 275.1353 283.537898 281.42100 289.39160
max neval cld
3.2328 100 a
793.7040 100 c
328.0863 100 b
它比 e1071 选项更快,虽然该选项在增加观察次数时失败,但建议的方法仍然适用。