r - 如何在 R 中模拟词袋模型以适应 SVM

Question

我的问题：我如何将词袋模型作为特征应用到 R 中的 svm 中？

我生成的一些数据很低：

Title Salary
"Software Engineer" 100000
"Software Engineer" 120000
"Junior Software Engineer" 60000
"Junior Software Engineer" 70000
"Senior Software Engineer" 130000

使用 read.table，我可以得到一个 2*n 的（字符，数字）矩阵。我想将“词袋”应用于标题列。但是，如果我只是手动拆分任何条目，例如

jobs['Title'][1,] <- strsplit(jobs['Title'][1,], ' ')

这给出了：

Title Salary
"Software" 100000
"Software Engineer" 120000
"Junior Software Engineer" 60000
"Junior Software Engineer" 70000
"Senior Software Engineer" 130000

而不是我预期的：

Title Salary
["Software", "Engineer"] 100000
"Software Engineer" 120000
"Junior Software Engineer" 60000
"Junior Software Engineer" 70000
"Senior Software Engineer" 130000

我调用 SVM 的代码如下所示：

jobs <- read.table("jobs.data", header = TRUE, as.is = TRUE)
index <- 1:nrow(jobs)
testindex <- sample(index, trunc(length(index)/3))
testset <- jobs[testindex,]
trainset <- jobs[-testindex,]
svm.model <- svm(Salary ~ ., data = trainset, cost = 10, gamma = 1)
svm.pred <- predict(svm.model, testset)

我想我弄错了，但我还没有找到方法，有人可以分享我应该怎么做吗？

谢谢你。

score 3 · Accepted Answer

令人担忧的是，机器学习问题中的一个基本问题被否决了。所以让我回答我自己的问题。

每个单词都分配有一个向量，其中 1 表示存在，0 表示不存在。本质上，这将形成一个稀疏矩阵，加上该类的一列。
使用 Python，使用字典来表示一袋单词。在 Python 中进行字符串操作要容易得多。将数据输入 NLTK 或 PyOrange。

这里的要点是 R 似乎不是一种用于字符串操作的语言。您可以使用 tm 库来帮助您。

我希望这可以帮助任何面临类似问题的人。

score 1 · Accepted Answer

在 R 中使用 tm 包很容易做到这一点：

require(Matrix)
require(e1071) 
require(tm)
options(stringsAsFactors = F)

jobs <- data.frame(Title = c("Software Engineer", "Software Engineer", 
                             "Junior Software Engineer", "Junior Software Engineer", 
                             "Senior Software Engineer", "Hardware Engineer"),
                   Salary = c(100000, 120000,
                              60000, 70000,
                              130000, 110000))

# Create the corpus
MyCorpus <- VCorpus(VectorSource(jobs$Title),  readerControl = list(language = "en"))
content(MyCorpus[[1]])

# Some preprocessing
MyCorpus <- tm_map(MyCorpus, content_transformer(tolower))
content(MyCorpus[[1]])

# Create the Document-Term matrix
DTM <- DocumentTermMatrix(MyCorpus, 
                          control = list(bounds = list(global = c(0, Inf)))) 
dim(DTM)
inspect(DTM)

# Create a sparse matrix to put into SVM
sparse_DTM <- sparseMatrix(i = DTM$i, j = DTM$j, x = DTM$v,
                               dims = dim(DTM),
                               dimnames = list(rownames(DTM), colnames(DTM)))

# SVM
svm.model <- svm(sparse_DTM, jobs$Salary, cost = 10, gamma = 1)

我让你处理训练集/测试集，并进一步了解 tm 包帮助。

r - 如何在 R 中模拟词袋模型以适应 SVM

2 回答 2

Related

Reference