0

我在有关使用 package 的多标签分类的教程中找到了此代码mlr

library("mlr")

yeast = getTaskData(yeast.task)
labels = colnames(yeast)[1:14]
yeast.task = makeMultilabelTask(id = "multi", data = yeast, target = labels)

lrn.br = makeLearner("classif.rpart", predict.type = "prob")
lrn.br = makeMultilabelBinaryRelevanceWrapper(lrn.br)

mod = train(lrn.br, yeast.task, subset = 1:1500, weights = rep(1/1500, 1500))

pred = predict(mod, task = yeast.task, subset = 1:10)
pred = predict(mod, newdata = yeast[1501:1600,])

我了解数据集的结构yeast,但是当我有要分类的新数据时,我不明白如何使用代码,因为那样我就不会有任何标签的 TRUE 或 FALSE 值。实际上,我会有一些结构相同的训练数据,yeast但对于我的新数据,1:14 列会丢失。我是不是误会了什么?如果不是:如何正确使用代码?

编辑:

这是我将如何使用代码的示例代码:

library("tm")

train.data = data.frame("id" = c(1,1,2,3,4,4), "text" = c("Monday is nice weather.", "Monday is nice weather.", "Dogs are cute.", "It is very rainy.", "My teacher is angry.", "My teacher is angry."), "label" = c("label1", "label2", "label3", "label1", "label4", "label5"))
test.data = data.frame("id" = c(5,6), "text" = c("Next Monday I will meet my teacher.", "Dogs do not like rain."))

train.data$text = as.character(train.data$text)
train.data$id = as.character(train.data$id)
train.data$label = as.character(train.data$label)
test.data$text = as.character(test.data$text)
test.data$id = as.character(test.data$id)

### Bring training data into structure
train.data$label = make.names(train.data$label)
labels = unique(train.data$label)

# DocumentTermMatrix for all texts
texts = unique(c(train.data$text, test.data$text))
docs <- Corpus(VectorSource(unique(texts)))
terms <-DocumentTermMatrix(docs)
m <- as.data.frame(as.matrix(terms))

# Logical columns for labels
test = data.frame("id" = train.data$id, "topic"=train.data$label)
test2 = as.data.frame(unclass(table(test)))
test2[,c(1:ncol(test2))] = as.logical(unlist(test2[,c(1:ncol(test2))]))
rownames(test2) = unique(test$id)

# Bind columns from dtm
termsDf = cbind(test2, m[1:nrow(test2),])
names(termsDf) = make.names(names(termsDf))

### Create Multilabel Task
classify.task = makeMultilabelTask(id = "multi", data = termsDf, target = labels)

### Now the model
lrn.br = makeLearner("classif.rpart", predict.type = "prob")
lrn.br = makeMultilabelBinaryRelevanceWrapper(lrn.br)
mod = train(lrn.br, classify.task)

### How can I predict for test.data?

所以,问题是我没有任何标签,test.data因为那是我真正想要计算的?

编辑2:

当我简单地使用

names(m) = make.names(names(m))
pred = predict(mod, newdata = m[(nrow(termsDf)+1):(nrow(termsDf)+nrow(test.data)),])

结果是两个文本相同,实际上不是我所期望的。

4

0 回答 0