r - 在 dfm() 输出中包含 ID 号

Question

我有一个包含 ID 号列和文本列的数据集，并且我正在使用该quanteda包对文本数据进行 LIWC 分析。这是我的数据设置的示例：

mydata<-data.frame(
  id=c(19,101,43,12),
  text=c("No wonder, then, that ever gathering volume from the mere transit ",
         "So that in many cases such a panic did he finally strike, that few ",
         "But there were still other and more vital practical influences at work",
         "Not even at the present day has the original prestige of the Sperm Whale"),
  stringsAsFactors=F
)

我已经能够使用scores <- dfm(as.character(mydata$text), dictionary = liwc)

但是，当我查看结果 ( View(scores)) 时，我发现该函数在最终结果中没有引用原始 ID 号 (19, 101, 43, 12)。相反，包含一row.names列，但它包含非描述性标识符（例如，“text1”、“text2”）：

如何获得dfm()在其输出中包含 ID 号的功能？谢谢！

score 1 · Accepted Answer

听起来您希望 dfm 对象的行名是您的mydata$id. 如果您将此 ID 声明为文本的文档名，这将自动发生。最简单的方法是从您的 data.frame 创建一个 quanteda 语料库对象。

下面的corpus()调用从您的id变量中分配文档名。注意：调用中的“文本”summary()看起来像一个数值，但它实际上是文本的文档名称。

require(quanteda)
myCorpus <- corpus(mydata[["text"]], docnames = mydata[["id"]])
summary(myCorpus)
# Corpus consisting of 4 documents.
# 
# Text Types Tokens Sentences
#   19    11     11         1
#  101    13     14         1
#   43    12     12         1
#   12    12     14         1
# 
# Source:  /Users/kbenoit/Dropbox/GitHub/quanteda/* on x86_64 by kbenoit
# Created: Tue Dec 29 11:54:00 2015
# Notes:

从那里，文档名称自动成为 dfm 中的行标签。（您可以dictionary =为您的 LIWC 应用程序添加参数。）

myDfm <- dfm(myCorpus, verbose = FALSE)
head(myDfm)
# Document-feature matrix of: 4 documents, 45 features.
# (showing first 4 documents and first 6 features)
#      features
# docs  no wonder then that ever gathering
#   19   1      1    1    1    1         1
#   101  0      0    0    2    0         0
#   43   0      0    0    0    0         0
#   12   0      0    0    0    0         0

r - 在 dfm() 输出中包含 ID 号

1 回答 1

Related

Reference