r - R：在 R 中的文档术语矩阵中查找与文档中的术语“欺诈”相关的前 10 个术语

Question

我有一个由年份命名的 39 个文本文件的语料库 - 1945.txt、1978.txt.... 2013.txt。

我已将它们导入 R 并使用 TM 包创建了一个文档术语矩阵。我正在尝试调查从 1945 年到 2013 年，与 term'fraud' 相关的单词多年来的变化情况。所需的输出将是一个 39 x 10/5 的矩阵，其中年份作为行标题，前 10 或 5 个术语作为列。

任何帮助将不胜感激。

提前致谢。

我的 TDM 的结构：

> str(ytdm)
List of 6
 $ i       : int [1:6791] 5 7 8 17 32 41 42 55 58 71 ...
 $ j       : int [1:6791] 1 1 1 1 1 1 1 1 1 1 ...
 $ v       : num [1:6791] 2 4 2 2 2 8 4 3 2 2 ...
 $ nrow    : int 193
 $ ncol    : int 39
 $ dimnames:List of 2
  ..$ Terms: chr [1:193] "abus" "access" "account" "accur" ...
  ..$ Docs : chr [1:39] "1947" "1976" "1977" "1978" ...
 - attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
 - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

My ideal output is like this:


1947   account accur gao medicine fed ......
1948   access  .............
.
.
.
.
.
.

score 3 · Accepted Answer

您的示例无法复制，但 findAssocs() 可能是您正在寻找的。由于您只想每年查看员工，因此您每年都需要一个 dtm。

> library(tm)
> data(crude)
> # i don't have your data so pretend this is corpus of docs for each year
> names(crude) <- rep(c("1999","2000"),10)
> # create a dtm for each year
> dtm.list <- lapply(unique(names(crude)),function(x) TermDocumentMatrix(crude[names(crude)==x]))
> # get associations for each year
> assoc.list <- lapply(dtm.list,findAssocs,term="oil",corlimit=0.7)
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
 prices barrel. 
   0.79    0.70 

$`2000`
     15.8      opec       and      said   prices,      sell       the  analysts   clearly     fixed 
     0.94      0.94      0.92      0.92      0.91      0.91      0.88      0.85      0.85      0.85 
     late   meeting     never      that    trying       who    winter emergency     above       but 
     0.85      0.85      0.85      0.85      0.85      0.85      0.85      0.84      0.83      0.83 
    world      they       mln    market agreement    before       bpd    buyers    energy    prices 
     0.82      0.80      0.79      0.78      0.75      0.75      0.75      0.75      0.75      0.75 
      set   through     under      will       not       its 
     0.75      0.75      0.75      0.74      0.72      0.70 

> # or if you want the 5 top terms
> assoc.list <- lapply(dtm.list,function(x) names(findAssocs(x,"oil",0)[1:5]))
> names(assoc.list) <- unique(names(crude))
> assoc.list
$`1999`
[1] "prices"   "barrel."  "said."    "minister" "arabian" 

$`2000`
[1] "15.8"    "opec"    "and"     "said"    "prices,"

r - R：在 R 中的文档术语矩阵中查找与文档中的术语“欺诈”相关的前 10 个术语

1 回答 1

Related

Reference