如果我理解正确,您已经制作了一个 dtm,并且您想从与第一个 dtm 具有相同列(即术语)的新文档中制作一个新的 dtm。如果是这种情况,那么应该通过第一个中的条款对第二个 dtm 进行子设置,可能是这样的:
首先设置一些可重现的数据...
这是你的训练数据...
library(tm)
# make corpus for text mining (data comes from package, for reproducibility)
data("crude")
corpus1 <- Corpus(VectorSource(crude[1:10]))
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
stripWhitespace, skipWords)
crude1 <- tm_map(corpus1, FUN = tm_reduce, tmFuns = funcs)
crude1.dtm <- DocumentTermMatrix(crude1, control = list(wordLengths = c(3,10)))
这是你的测试数据...
corpus2 <- Corpus(VectorSource(crude[15:20]))
# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers,
stripWhitespace, skipWords)
crude2 <- tm_map(corpus2, FUN = tm_reduce, tmFuns = funcs)
crude2.dtm <- DocumentTermMatrix(crude2, control = list(wordLengths = c(3,10)))
这是做你想做的事:
现在我们只保留训练数据中存在的测试数据中的术语......
# convert to matrices for subsetting
crude1.dtm.mat <- as.matrix(crude1.dtm) # training
crude2.dtm.mat <- as.matrix(crude2.dtm) # testing
# subset testing data by colnames (ie. terms) or training data
xx <- data.frame(crude2.dtm.mat[,intersect(colnames(crude2.dtm.mat),
colnames(crude1.dtm.mat))])
最后将训练数据中不在测试数据中的术语的所有空列添加到测试数据中......
# make an empty data frame with the colnames of the training data
yy <- read.table(textConnection(""), col.names = colnames(crude1.dtm.mat),
colClasses = "integer")
# add incols of NAs for terms absent in the
# testing data but present # in the training data
# following SchaunW's suggestion in the comments above
library(plyr)
zz <- rbind.fill(xx, yy)
测试文档的数据框也是如此zz
,但与训练文档具有相同的结构(即相同的列,尽管其中许多包含 NA,如 SchaunW 所述)。
这符合你想要的吗?