我的问题:我如何将词袋模型作为特征应用到 R 中的 svm 中?
我生成的一些数据很低:
Title Salary
"Software Engineer" 100000
"Software Engineer" 120000
"Junior Software Engineer" 60000
"Junior Software Engineer" 70000
"Senior Software Engineer" 130000
使用 read.table,我可以得到一个 2*n 的(字符,数字)矩阵。我想将“词袋”应用于标题列。但是,如果我只是手动拆分任何条目,例如
jobs['Title'][1,] <- strsplit(jobs['Title'][1,], ' ')
这给出了:
Title Salary
"Software" 100000
"Software Engineer" 120000
"Junior Software Engineer" 60000
"Junior Software Engineer" 70000
"Senior Software Engineer" 130000
而不是我预期的:
Title Salary
["Software", "Engineer"] 100000
"Software Engineer" 120000
"Junior Software Engineer" 60000
"Junior Software Engineer" 70000
"Senior Software Engineer" 130000
我调用 SVM 的代码如下所示:
jobs <- read.table("jobs.data", header = TRUE, as.is = TRUE)
index <- 1:nrow(jobs)
testindex <- sample(index, trunc(length(index)/3))
testset <- jobs[testindex,]
trainset <- jobs[-testindex,]
svm.model <- svm(Salary ~ ., data = trainset, cost = 10, gamma = 1)
svm.pred <- predict(svm.model, testset)
我想我弄错了,但我还没有找到方法,有人可以分享我应该怎么做吗?
谢谢你。