在使用 R 进行文本挖掘时,在对文本数据进行再处理后,我们需要创建一个文档术语矩阵以进行进一步探索。但是和中文类似,英文也有一些特定的阶段,例如“语义距离”,“机器学习”,如果将它们分割成单词,它的含义就完全不同了,我想知道如何将文档分割成阶段而不是词(词)。
问问题
1084 次
1 回答
0
您可以使用quanteda包在 R 中执行此操作,它可以将多词表达式检测为统计搭配,这可能是您在英语中所指的多词表达式。要删除包含停用词的搭配,您首先要对文本进行标记,然后删除停用词,留下一个“填充”以防止结果中出现错误的邻接(在去除它们之间的停用词之前实际上并不相邻的两个词)。
require(quanteda)
pres_tokens <-
tokens(data_corpus_inaugural) %>%
tokens_remove("\\p{P}", padding = TRUE, valuetype = "regex") %>%
tokens_remove(stopwords("english"), padding = TRUE)
pres_collocations <- textstat_collocations(pres_tokens, size = 2)
head(pres_collocations)
# collocation count count_nested length lambda z
# 1 united states 157 0 2 7.893307 41.19459
# 2 let us 97 0 2 6.291128 36.15520
# 3 fellow citizens 78 0 2 7.963336 32.93813
# 4 american people 40 0 2 4.426552 23.45052
# 5 years ago 26 0 2 7.896626 23.26935
# 6 federal government 32 0 2 5.312702 21.80328
# convert the corpus collocations into single tokens, for top 1,500 collocations
pres_compounded_tokens <- tokens_compound(pres_tokens, pres_collocations[1:1500])
tokens_select(pres_compounded_tokens[2], "*_*")
# tokens from 1 document.
# 1793-Washington :
# [1] "called_upon" "shall_endeavor" "high_sense" "official_act"
使用这个“复合”标记集,我们现在可以将其转换为文档特征矩阵,其中特征由原始术语(在搭配中找不到的)和搭配组成。如下所示,“united”单独出现并作为搭配“united_states”的一部分。
pres_dfm <- dfm(pres_compounded_tokens)
head(pres_dfm[1:5, grep("united|states", featnames(pres_dfm))])
# Document-feature matrix of: 5 documents, 10 features (86% sparse).
# 5 x 10 sparse Matrix of class "dfm"
# features
# docs united states statesmen statesmanship reunited unitedly devastates statesman confederated_states united_action
# 1789-Washington 4 2 0 0 0 0 0 0 0 0
# 1793-Washington 1 0 0 0 0 0 0 0 0 0
# 1797-Adams 3 9 0 0 0 0 0 0 0 0
# 1801-Jefferson 0 0 0 0 0 0 0 0 0 0
# 1805-Jefferson 1 4 0 0 0 0 0 0 0 0
如果您想要一种更蛮力的方法,可以通过这种方式简单地创建一个文档-by-bigram 矩阵:
# just form all bigrams
head(dfm(data_inaugural_corpus, ngrams = 2))
## Document-feature matrix of: 57 documents, 63,866 features.
## (showing first 6 documents and first 6 features)
## features
## docs fellow-citizens_of of_the the_senate senate_and and_of the_house
## 1789-Washington 1 20 1 1 2 2
## 1797-Adams 0 29 0 0 2 0
## 1793-Washington 0 4 0 0 1 0
## 1801-Jefferson 0 28 0 0 3 0
## 1805-Jefferson 0 17 0 0 1 0
## 1809-Madison 0 20 0 0 2 0
于 2016-04-18T21:03:42.523 回答