r - R tm 语料库对象的拆分样本

Question

我正在使用 R tm 包，试图将我的语料库拆分为训练集和测试集，并将其编码为元数据以供选择。最简单的方法是什么（假设我试图将样本分成两半）？

以下是我尝试过的一些事情：

我知道当我打字时...

> meta(d)
    MetaID Y
1        0 1
2        0 1

我看到了 ID，但似乎无法访问它们（为了说前半部分属于一组，而第二部分属于另一组）。 rownames(attributes(d)$DMetaData)给了我索引，但这看起来很难看，它们是因素。

现在，在转换为数据框后，说 d 是我的数据集，我只是说：

half <- floor(dim(d)[1]/2)
d$train <- d[1:half,]
d$test <- d[(half+1):(half*2),]

但是我怎么能轻松地做类似的事情......

meta(d, tag="split") = ifelse((meta(d,"ID")<=floor(length(d)/2)),"train","test")

...得到如下结果：

> meta(d)
    MetaID Y split
1        0 1 train
2        0 1 train
...      . . ...
100      0 1 test

不幸的是，meta(d,"ID")不起作用，但meta(d[[1]],"ID") == 1确实如此，但它是多余的。我正在寻找一种访问元 ID 的全向量方式，或者一种更智能的子集化方式并分配给“拆分”元变量。

score 4 · Accepted Answer

语料库只是一个列表。所以你可以像普通列表一样拆分它。这里有一个例子：

我创建了一些数据。我使用tm包内的数据

txt <- system.file("texts", "txt", package = "tm")
(ovid <- Corpus(DirSource(txt)))
A corpus with 5 text documents

现在我将数据拆分为训练和测试

nn <- length(ovid)
ff <- as.factor(c(rep('Train',ceiling(nn/2)),   ## you create the split factor as you want
                rep('Test',nn-ceiling(nn/2))))  ## you can add validation set for example...
ll <- split(as.matrix(ovid),ff)
ll
$Test
A corpus with 2 text documents

$Train
A corpus with 3 text documents

然后我分配新标签

ll <- sapply( names(ll),
              function(x) {
                meta(ll[[x]],tag = 'split') <- ff[ff==x]
                ll[x]
              })

您可以检查结果：

lapply(ll,meta)
$Test.Test
  MetaID split
4      0  Test
5      0  Test

$Train.Train
  MetaID split
1      0 Train
2      0 Train
3      0 Train

score 2 · Accepted Answer

## use test corpus crude in tm
library(tm)
data(crude)

#random training sample
half<-floor(length(crude)/2)
train<-sample(1:length(crude), half)

# meta doesnt handle lists or vector very well, so loop:
for (i in 1:length(crude)) meta(crude[[i]], tag="Tset") <- "test"
for (i in 1:half) meta(crude[[train[i]]], tag="Tset") <- "train"

# check result
for (i in 1:10) print(meta(crude[[i]], tag="Tset"))

这似乎有效。

r - R tm 语料库对象的拆分样本

2 回答 2

Related

Reference