r - R：支持向量机-TermDocumentMatrix上的自然语言处理

Question

我已经开始从事一个需要自然语言处理并在 R 中建立支持向量机 (SVM) 模型的项目。

我想生成一个包含所有标记的术语文档矩阵。

例子：

testset <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.",  "M6 is 13 days out of the visit window")
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
test_annotations <- annotate(testset, list(sent_ann, word_ann))
test_doc <- AnnotatedPlainTextDocument(testset, test_annotations)
sents(test_doc)

[[1]]
 [1] "From"       "month"      "2"          "the"        "AST"        "and"        "total"     
 [8] "bilirubine" "were"       "not"        "measured"   "."         

[[2]]
 [1] "16:OTHER"                         "-"                               
 [3] "COMMENT"                          "REQUIRED"                        
 [5] "IN"                               "COMMENT"                         
 [7] "COLUMN;07/02/2004/GENOTYPING;SF-" "genotyping"                      
 [9] "consent"                          "not"                             
[11] "offered"                          "until"                           
[13] "T4"                               "."                               

[[3]]
[1] "M6"     "is"     "13"     "days"   "out"    "of"     "the"    "visit"  "window"

然后我生成了一个 TDM：

tdm <- TermDocumentMatrix(as.VCorpus(list(test_doc)))
inspect(tdm)
<<TermDocumentMatrix (terms: 22, documents: 1)>>
Non-/sparse entries: 22/0
Sparsity           : 0%
Maximal term length: 32
Weighting          : term frequency (tf)

                                  Docs
Terms                              NULL
  16:other                            1
  and                                 1
  ast                                 1
  bilirubine                          1
  column;07/02/2004/genotyping;sf-    1
  comment                             2
  consent                             1
  days                                1
  from                                1
  genotyping                          1
  measured                            1
  month                               1
  not                                 2
  offered                             1
  out                                 1
  required                            1
  the                                 2
  total                               1
  until                               1
  visit                               1
  were                                1
  window                              1

我实际上在数据集中有三个文件：“从第 2 个月开始，未测量 AST 和总胆红素。”，“16：其他 - 评论栏中需要评论；07/02/2004/GENOTYPING；SF- 基因分型同意直到T4.",
"M6 离访问窗口还有 13 天" 所以它应该显示 3 列文件。但我这里只显示一列。

有人可以给我一些建议吗？

sessionInfo()
    R version 3.3.0 (2016-05-03)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows >= 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tm_0.6-2       openxlsx_3.0.0 magrittr_1.5   RWeka_0.4-28   openNLP_0.2-6  NLP_0.1-9     
[7] rJava_0.9-8

score 0 · Accepted Answer

我认为您要做的是获取 3 个字符串的列表，然后尝试将其制成语料库。我不确定列表中是否有 3 个不同的字符串可用于 3 个 diff 文档。

我把你的数据放入 3 个 txt 文件并运行它。

text_name <- file.path("C:\", "texts")
dir(text_name)

[1] "text1.txt" "text2.txt" "text3.txt"

如果你不想做任何清理，你可以直接将其转换为语料库

docs <- Corpus(DirSource(text_name)) 
summary(docs)
          Length Class             Mode
text1.txt 2      PlainTextDocument list
text2.txt 2      PlainTextDocument list
text3.txt 2      PlainTextDocument list

dtm <- DocumentTermMatrix(docs)   
dtm

<<DocumentTermMatrix (documents: 3, terms: 22)>>
Non-/sparse entries: 24/42
Sparsity           : 64%
Maximal term length: 32
Weighting          : term frequency (tf)

tdm <- TermDocumentMatrix(docs) 
tdm
TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity           : 64%
Maximal term length: 32
Weighting          : term frequency (tf)

inspect(tdm)


<<TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity           : 64%
Maximal term length: 32
Weighting          : term frequency (tf)

                              Docs
Terms                              text1.txt text2.txt text3.txt
16:other                                 0         1         0
and                                      1         0         0
ast                                      1         0         0
bilirubine                               1         0         0
column;07/02/2004/genotyping;sf-         0         1         0
comment                                  0         2         0
consent                                  0         1         0
days                                     0         0         1
from                                     1         0         0
genotyping                               0         1         0
measured.                                1         0         0
month                                    1         0         0
not                                      1         1         0
offered                                  0         1         0
out                                      0         0         1
required                                 0         1         0
the                                      1         0         1
total                                    1         0         0
until                                    0         1         0
visit                                    0         0         1
were                                     1         0         0
window                                   0         0         1

我认为您可能想要创建 3 个不同的列表，然后将其转换为语料库。让我知道这是否有帮助。

score 0 · Accepted Answer

因此，考虑到您希望文本列中的每一行作为将列表覆盖到数据框的文档

df=data.frame(testset)
install.package("tm")
docs=Corpus(VectorSource(df$testset))
summary(docs)
  Length Class             Mode
1 2      PlainTextDocument list
2 2      PlainTextDocument list
3 2      PlainTextDocument list

在此之后按照上一个答案中提到的步骤获取您的 tdm。这应该可以解决您的问题

r - R：支持向量机-TermDocumentMatrix上的自然语言处理

2 回答 2

Related

Reference