首先,您可以将文档名称添加到您的语料库中:
document_names <- c("doc1", "doc2", "doc3")
a_corpus <- quanteda::corpus(x = c("some corpus text of no consequence that in practice is going to be very large",
"and so one might expect a very large number of ngrams but for nlp purposes only care about top ten",
"adding some corpus text word repeats to ensure ngrams top ten selection approaches are working"),
docnames = document_names)
a_corpus
# Corpus consisting of 3 documents and 0 docvars.
现在,您可以在后续 quanteda 函数调用中使用文档名称。
ngrams_dfm <- quanteda::dfm(a_corpus, tolower = T, stem = F, ngrams = 2)
ngrams_dfm
# Document-feature matrix of: 3 documents, 43 features (63.6% sparse).
您还可以使用组选项textstat_frequency
来获取频率结果中的文档名称
freq = textstat_frequency(ngrams_dfm, groups = docnames(ngrams_dfm))
head(freq)
feature frequency rank docfreq group
1 some_corpus 1 1 1 doc1
2 corpus_text 1 2 1 doc1
3 text_of 1 3 1 doc1
4 of_no 1 4 1 doc1
5 no_consequence 1 5 1 doc1
6 consequence_that 1 6 1 doc1
如果要将数据从 ngrams_dfm 获取到 data.frame,则 quanteda 中有以下convert
功能:
convert(ngrams_dfm, to = "data.frame")
document some_corpus corpus_text text_of of_no no_consequence consequence_that that_in in_practice practice_is is_going going_to to_be
1 doc1 1 1 1 1 1 1 1 1 1 1 1 1
2 doc2 0 0 0 0 0 0 0 0 0 0 0 0
3 doc3 1 1 0 0 0 0 0 0 0 0 0 0
您可以重新调整它以获得您想要的:这里是 dplyr / tidyr 的示例。
library(dplyr)
convert(ngrams_dfm, to = "data.frame") %>%
tidyr::gather(feature, frequency, -document) %>%
group_by(document, feature) %>%
summarise(frequency = sum(frequency))
# A tibble: 129 x 3
# Groups: document [?]
document feature frequency
<chr> <chr> <dbl>
1 doc1 a_very 0
2 doc1 about_top 0
3 doc1 adding_some 0
4 doc1 and_so 0
5 doc1 approaches_are 0
6 doc1 are_working 0
7 doc1 be_very 1
8 doc1 but_for 0
9 doc1 care_about 0
10 doc1 consequence_that 1
# ... with 119 more rows
或使用 data.table:
out <- data.table(convert(ngrams_dfm, to = "data.frame"))
melt(out, id.vars = "document",
variable.name = "feature", value.name = "freq")
document feature freq
1: doc1 some_corpus 1
2: doc2 some_corpus 0
3: doc3 some_corpus 1
4: doc1 corpus_text 1
5: doc2 corpus_text 0
---
125: doc2 care_about 1
126: doc3 care_about 0
127: doc1 about_top 0
128: doc2 about_top 1
129: doc3 about_top 0