r - 词频对列表到 R 中的矩阵中

Question

我有一个格式如下的大型数据集，其中每一行都有一个文档，编码为 word:freqency-in-the-document，用空格分隔；行可以是可变长度：

aword:3 bword:2 cword:15 dword:2
bword:4 cword:20 fword:1
etc...

例如，在第一个文档中，“aword”出现了 3 次。我最终想做的是创建一个小型搜索引擎，对匹配查询的文档（格式相同）进行排名；我虽然关于使用 TfIdf 和 tm 包（基于本教程，它要求数据采用 TermDocumentMatrix 的格式：http: //anythingbutrbitrary.blogspot.be/2013/03/build-search-engine-in- 20 分钟或 .html）。否则，我只会在文本语料库上使用 tm 的 TermDocumentMatrix 函数，但这里的问题是我已经以这种格式索引了这些数据（我宁愿使用这些数据，除非格式确实是陌生的，并且不能转换）。

到目前为止，我尝试的是导入行并拆分它们：

docs <- scan("data.txt", what="", sep="\n")
doclist <- strsplit(docs, "[[:space:]]+")

我想我会把这样的东西放在一个循环中：

doclist2 <- strsplit(doclist, ":", fixed=TRUE)

并以某种方式将配对值放入一个数组中，然后运行一个循环，通过从 word:freq 对中获取适当的值来填充矩阵（预填充零：matrix(0,x,y)）（在本身是构造矩阵的好主意吗？）。但是这种转换方式似乎不是一个好方法，列表越来越复杂，我仍然不知道如何达到可以填充矩阵的地步。

我（认为我）最终需要的是这样的矩阵：

        doc1 doc2 doc3 doc4 ...
aword   3    0    0    0 
bword   2    4    0    0
cword:  15   20   0    0
dword   2    0    0    0
fword:  0    1    0    0
...

然后我可以将其转换为 TermDocumentMatrix 并开始学习本教程。我有一种感觉，我在这里遗漏了一些非常明显的东西，我可能找不到一些东西，因为我不知道这些东西叫什么（我已经在谷歌上搜索了一天，主题是“术语文档向量/数组/对”、“二维数组”、“列表成矩阵”等）。

将这样的文档列表放入术语文档频率矩阵的好方法是什么？或者，如果使用内置函数解决方案过于明显或可行：我上面描述的格式的实际术语是什么，其中有这些术语：频率对在一行上，每一行都是一个文档？

score 0 · Accepted Answer

这是一种方法，可以为您提供您可能想要的输出：

## Your sample data
x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1")
## Split on a spaces and colons    
B <- strsplit(x, "\\s+|:")
## Add names to your list to represent the source document
B <- setNames(B, paste0("document", seq_along(B)))
## Put everything together into a long matrix
out <- do.call(rbind, lapply(seq_along(B), function(x) 
  cbind(document = names(B)[x], matrix(B[[x]], ncol = 2, byrow = TRUE,
                                dimnames = list(NULL, c("word", "count"))))))

## Convert to a data.frame
out <- data.frame(out)
out
#    document  word count
# 1 document1 aword     3
# 2 document1 bword     2
# 3 document1 cword    15
# 4 document1 dword     2
# 5 document2 bword     4
# 6 document2 cword    20
# 7 document2 fword     1
## Make sure the counts column is a number
out$count <- as.numeric(as.character(out$count))

## Use xtabs to get the output you want
xtabs(count ~ word + document, out)
#        document
# word    document1 document2
#   aword         3         0
#   bword         2         4
#   cword        15        20
#   dword         2         0
#   fword         0         1

注意：答案编辑为在创建“out”时使用矩阵，以最大限度地减少调用次数，read.table这将成为更大数据的主要瓶颈。

r - 词频对列表到 R 中的矩阵中

1 回答 1

Related

Reference