r - R文本挖掘如何将文档分割成短语而不是术语

Question

在使用 R 进行文本挖掘时，在对文本数据进行再处理后，我们需要创建一个文档术语矩阵以进行进一步探索。但是和中文类似，英文也有一些特定的阶段，例如“语义距离”，“机器学习”，如果将它们分割成单词，它的含义就完全不同了，我想知道如何将文档分割成阶段而不是词（词）。

score 0 · Accepted Answer

您可以使用quanteda包在 R 中执行此操作，它可以将多词表达式检测为统计搭配，这可能是您在英语中所指的多词表达式。要删除包含停用词的搭配，您首先要对文本进行标记，然后删除停用词，留下一个“填充”以防止结果中出现错误的邻接（在去除它们之间的停用词之前实际上并不相邻的两个词）。

require(quanteda)

pres_tokens <- 
    tokens(data_corpus_inaugural) %>%
    tokens_remove("\\p{P}", padding = TRUE, valuetype = "regex") %>%   
    tokens_remove(stopwords("english"), padding = TRUE)

pres_collocations <- textstat_collocations(pres_tokens, size = 2)

head(pres_collocations)
#          collocation count count_nested length   lambda        z
# 1      united states   157            0      2 7.893307 41.19459
# 2             let us    97            0      2 6.291128 36.15520
# 3    fellow citizens    78            0      2 7.963336 32.93813
# 4    american people    40            0      2 4.426552 23.45052
# 5          years ago    26            0      2 7.896626 23.26935
# 6 federal government    32            0      2 5.312702 21.80328

# convert the corpus collocations into single tokens, for top 1,500 collocations
pres_compounded_tokens <- tokens_compound(pres_tokens, pres_collocations[1:1500])

tokens_select(pres_compounded_tokens[2], "*_*")
# tokens from 1 document.
# 1793-Washington :
# [1] "called_upon"    "shall_endeavor" "high_sense"     "official_act"

使用这个“复合”标记集，我们现在可以将其转换为文档特征矩阵，其中特征由原始术语（在搭配中找不到的）和搭配组成。如下所示，“united”单独出现并作为搭配“united_states”的一部分。

pres_dfm <- dfm(pres_compounded_tokens)
head(pres_dfm[1:5, grep("united|states", featnames(pres_dfm))])
# Document-feature matrix of: 5 documents, 10 features (86% sparse).
# 5 x 10 sparse Matrix of class "dfm"
#                  features
# docs              united states statesmen statesmanship reunited unitedly devastates statesman confederated_states united_action
#   1789-Washington      4      2         0             0        0        0          0         0                   0             0
#   1793-Washington      1      0         0             0        0        0          0         0                   0             0
#   1797-Adams           3      9         0             0        0        0          0         0                   0             0
#   1801-Jefferson       0      0         0             0        0        0          0         0                   0             0
#   1805-Jefferson       1      4         0             0        0        0          0         0                   0             0

如果您想要一种更蛮力的方法，可以通过这种方式简单地创建一个文档-by-bigram 矩阵：

# just form all bigrams
head(dfm(data_inaugural_corpus, ngrams = 2))
## Document-feature matrix of: 57 documents, 63,866 features.
## (showing first 6 documents and first 6 features)
##                  features
## docs              fellow-citizens_of of_the the_senate senate_and and_of the_house
##   1789-Washington                  1     20          1          1      2         2
##   1797-Adams                       0     29          0          0      2         0
##   1793-Washington                  0      4          0          0      1         0
##   1801-Jefferson                   0     28          0          0      3         0
##   1805-Jefferson                   0     17          0          0      1         0
##   1809-Madison                     0     20          0          0      2         0

r - R文本挖掘如何将文档分割成短语而不是术语

1 回答 1

Related

Reference