0

我使用 Quanteda R 包从文本 Data_clean$Review 中提取 ngrams(这里是 1grams 和 2grams),但我正在寻找一种使用 R 来计算文档和提取的 ngrams 之间的卡方的方法:

这是我为清理文本(review)和生成 n-gram 所做的 R 代码。

请问有什么想法吗?

谢谢你

#delete rows with empty value columns
Data_clean <- Data[Data$Note!="" & Data$Review!="",]


Data_clean$id <- seq.int(nrow(Data_clean))

train.index <- 1:50000
test.index <- 50001:nrow(Data_clean)


#clean up
# remove grammar/punctuation
Data_clean$Review.clean <- tolower(gsub('[[:punct:]0-9]', ' ', Data_clean$Review))


train <- Data_clean[train.index, ]
test <- Data_clean[test.index, ]

temp.tf <- Data_clean$Raison.Reco.clean %>% tokens(ngrams = 1:2) %>% # generate tokens
      dfm  # generate dfm
4

1 回答 1

1

您不会ngrams为此使用,而是使用一个名为textstat_collocations().

由于没有解释或提供这些对象,因此很难遵循您的确切示例,但是让我们尝试使用quanteda的一些内置数据。我将从就职语料库中获取文本并应用一些类似于您上面的过滤器。

因此,要为 chi^2 打分,您可以使用:

# create the corpus, subset on some conditions (could be Note != "" for instance)
corp_example <- data_corpus_inaugural
corp_example <- corpus_subset(corp_example, Year > 1960)

# this will remove punctuation and numbers
toks_example <- tokens(corp_example, remove_punct = TRUE, remove_numbers = TRUE)

# find and score chi^2 bigrams
coll2 <- textstat_collocations(toks_example, method = "chi2", max_size = 2)
head(coll2, 10)
#             collocation count       X2
# 1       reverend clergy     2 28614.00
# 2       Majority Leader     2 28614.00
# 3       Information Age     2 28614.00
# 4      Founding Fathers     3 28614.00
# 5  distinguished guests     3 28614.00
# 6       Social Security     3 28614.00
# 7         Chief Justice     9 23409.82
# 8          middle class     4 22890.40
# 9       Abraham Lincoln     2 19075.33
# 10       society's ills     2 19075.33

补充

# needs to be a list of the collocations as separate character elements
coll2a <- sapply(coll2$collocation, strsplit, " ", USE.NAMES = FALSE)

# compound the tokens using top 100 collocations
toks_example_comp <- tokens_compound(toks_example, coll2a[1:100])
toks_example_comp[[1]][1:20]
# [1] "Vice_President"  "Johnson"         "Mr_Speaker"      "Mr_Chief"        "Chief_Justice"  
# [6] "President"       "Eisenhower"      "Vice_President"  "Nixon"           "President"      
# [11] "Truman"          "reverend_clergy" "fellow_citizens" "we"              "observe"        
# [16] "today"           "not"             "a"               "victory"         "of"             
于 2017-05-18T08:16:34.003 回答