-2

我有一个用 quanteda 创建的 10025x1417 TFIDF dfm矩阵。(实际类是 dfmSparse ,它是dfm-matrix的子类)。当我使用 as.data.frame 然后 as.h2o 转换为 h2o 时,我错误地得到 1002 6 x1417,并带有不需要的额外第一行 NaN。出于性能原因,我不想使用完整的密集矩阵创建临时 df 。

代码如下(我无法在小玩具数据上重现):

library(quanteda)
mat <- quanteda::weight(theDfm, type="tfidf")

# Convert to df then h2o, correctly gives 10025x1417 matrix
mat_df  <- as.data.frame(mat) # this will dispatch quanteda::as.data.frame for dfmSparse
mat_h2o <- as.h2o(mat_df)

# Convert in one go, get 10026x1417, get unwanted extra first row of NaNs
bad_h2o <- as.h2o(as.data.frame(mat))
dim(bad_h2o )
[1] 10026  1417

# Which as.data.frame method this uses
> showMethods(quanteda::as.data.frame)
Function: as.data.frame (package base)
x="ANY"
x="dfm"
x="dfmSparse"
    (inherited from: x="dfm")
x="matrix"
    (inherited from: x="ANY")

#########################################
# Ken Benoit requested sessionInfo()

R version 3.2.3 (2015-12-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] h2o_3.8.3.3         statmod_1.4.22      quanteda_0.9.8      RevoUtilsMath_3.2.3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.2      lattice_0.20-33  SnowballC_0.5.1  bitops_1.0-6     chron_2.3-47     grid_3.2.3       R6_2.1.1        
 [8] jsonlite_0.9.19  magrittr_1.5     httr_1.0.0       stringi_1.0-1    data.table_1.9.6 ca_0.58          Matrix_1.2-3    
[15] tools_3.2.3      stringr_1.0.0    RCurl_1.95-4.7   parallel_3.2.3 
4

1 回答 1

3

出于性能原因,我不想使用完整的密集矩阵创建临时 df 。

事实上,quanteda在转换之前将您的稀疏矩阵转换为密集矩阵data.framehttps ://github.com/kbenoit/quanteda/blob/master/R/dfm-classes.R#L513-L516

如果需要将稀疏矩阵导入 h2o,请将其转换为 svmlight 格式并使用importFile. 请参阅此主题:如何在 R 中的特征散列矩阵上使用 H2o

于 2016-08-16T12:20:45.157 回答