我已经开始从事一个需要自然语言处理并在 R 中建立支持向量机 (SVM) 模型的项目。
我想生成一个包含所有标记的术语文档矩阵。
例子:
testset <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.", "M6 is 13 days out of the visit window")
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
test_annotations <- annotate(testset, list(sent_ann, word_ann))
test_doc <- AnnotatedPlainTextDocument(testset, test_annotations)
sents(test_doc)
[[1]]
[1] "From" "month" "2" "the" "AST" "and" "total"
[8] "bilirubine" "were" "not" "measured" "."
[[2]]
[1] "16:OTHER" "-"
[3] "COMMENT" "REQUIRED"
[5] "IN" "COMMENT"
[7] "COLUMN;07/02/2004/GENOTYPING;SF-" "genotyping"
[9] "consent" "not"
[11] "offered" "until"
[13] "T4" "."
[[3]]
[1] "M6" "is" "13" "days" "out" "of" "the" "visit" "window"
然后我生成了一个 TDM:
tdm <- TermDocumentMatrix(as.VCorpus(list(test_doc)))
inspect(tdm)
<<TermDocumentMatrix (terms: 22, documents: 1)>>
Non-/sparse entries: 22/0
Sparsity : 0%
Maximal term length: 32
Weighting : term frequency (tf)
Docs
Terms NULL
16:other 1
and 1
ast 1
bilirubine 1
column;07/02/2004/genotyping;sf- 1
comment 2
consent 1
days 1
from 1
genotyping 1
measured 1
month 1
not 2
offered 1
out 1
required 1
the 2
total 1
until 1
visit 1
were 1
window 1
我实际上在数据集中有三个文件:“从第 2 个月开始,未测量 AST 和总胆红素。”,“16:其他 - 评论栏中需要评论;07/02/2004/GENOTYPING;SF- 基因分型同意直到T4.",
"M6 离访问窗口还有 13 天" 所以它应该显示 3 列文件。但我这里只显示一列。
有人可以给我一些建议吗?
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm_0.6-2 openxlsx_3.0.0 magrittr_1.5 RWeka_0.4-28 openNLP_0.2-6 NLP_0.1-9
[7] rJava_0.9-8