我想在我的 unigram 频率表中保留两个字母的首字母缩写词,它们用“tv”和“us”等句点分隔。当我用 quanteda 构建我的 unigram 频率表时,终止周期被截断。这是一个小的测试语料库来说明。我已删除句点作为句子分隔符:
SOS This is the u.s. where our politics is crazy EOS
SOS In the US we watch a lot of t.v. aka TV EOS
SOS TV is an important part of life in the US EOS
SOS folks outside the u.s. probably don't watch so much t.v. EOS
SOS living in other countries is probably not any less crazy EOS
SOS i enjoy my sanity when it comes to visit EOS
我将其作为字符向量加载到 R 中:
acro.test <- c("SOS This is the u.s. where our politics is crazy EOS", "SOS In the US we watch a lot of t.v. aka TV EOS", "SOS TV is an important part of life in the US EOS", "SOS folks outside the u.s. probably don't watch so much t.v. EOS", "SOS living in other countries is probably not any less crazy EOS", "SOS i enjoy my sanity when it comes to visit EOS")
这是我用来构建我的 unigram 频率表的代码:
library(quanteda)
dat.dfm <- dfm(acro.test, ngrams=1, verbose=TRUE, concatenator=" ", toLower=FALSE, removeNumbers=TRUE, removePunct=FALSE, stopwords=FALSE)
dat.mat <- as.data.frame(as.matrix(docfreq(dat.dfm)))
ng.sorted <- sort(rowSums(dat.mat), decreasing=TRUE)
freqTable <- data.frame(ngram=names(ng.sorted), frequency = ng.sorted)
row.names(freqTable) <- NULL
freqTable
这会产生以下结果:
ngram frequency
1 SOS 6
2 EOS 6
3 the 4
4 is 3
5 . 3
6 u.s 2
7 crazy 2
8 US 2
9 watch 2
10 of 2
11 t.v 2
12 TV 2
13 in 2
14 probably 2
15 This 1
16 where 1
17 our 1
18 politics 1
19 In 1
20 we 1
21 a 1
22 lot 1
23 aka 1
ETC...
我想保留电视和我们的终端时段,并消除表格中的条目。频率为 3。
我也不明白为什么句号 (.) 在正确计算 us 和 tv unigram 时会在此表中计数为 3(每个 2)。