1

我想在我的 unigram 频率表中保留两个字母的首字母缩写词,它们用“tv”和“us”等句点分隔。当我用 quanteda 构建我的 unigram 频率表时,终止周期被截断。这是一个小的测试语料库来说明。我已删除句点作为句子分隔符:

SOS This is the u.s. where our politics is crazy EOS

SOS In the US we watch a lot of t.v. aka TV EOS

SOS TV is an important part of life in the US EOS

SOS folks outside the u.s. probably don't watch so much t.v. EOS

SOS living in other countries is probably not any less crazy EOS

SOS i enjoy my sanity when it comes to visit EOS

我将其作为字符向量加载到 R 中:

acro.test <- c("SOS This is the u.s. where our politics is crazy EOS", "SOS In the US we watch a lot of t.v. aka TV EOS", "SOS TV is an important part of life in the US EOS", "SOS folks outside the u.s. probably don't watch so much t.v. EOS", "SOS living in other countries is probably not any less crazy EOS", "SOS i enjoy my sanity when it comes to visit EOS")

这是我用来构建我的 unigram 频率表的代码:

library(quanteda)
dat.dfm <- dfm(acro.test, ngrams=1, verbose=TRUE, concatenator=" ",  toLower=FALSE, removeNumbers=TRUE, removePunct=FALSE, stopwords=FALSE)
dat.mat <- as.data.frame(as.matrix(docfreq(dat.dfm)))
ng.sorted <- sort(rowSums(dat.mat), decreasing=TRUE)
freqTable <- data.frame(ngram=names(ng.sorted), frequency = ng.sorted)
row.names(freqTable) <- NULL
freqTable

这会产生以下结果:

       ngram frequency
1        SOS         6
2        EOS         6
3        the         4
4         is         3
5          .         3
6        u.s         2
7      crazy         2
8         US         2
9      watch         2
10        of         2
11       t.v         2
12        TV         2
13        in         2
14  probably         2
15      This         1
16     where         1
17       our         1
18  politics         1
19        In         1
20        we         1
21         a         1
22       lot         1
23       aka         1

ETC...

我想保留电视和我们的终端时段,并消除表格中的条目。频率为 3。

我也不明白为什么句号 (.) 在正确计算 us 和 tv unigram 时会在此表中计数为 3(每个 2)。

4

1 回答 1

2

这种行为的原因是quanteda的默认单词标记器使用基于 ICU 的单词边界定义(来自stringi包)。 u.s.显示为单词u.s.后跟句.点标记。如果您的名字是will.i.am ,那就太好了,但对于您的目的来说可能不是那么好。what = "fasterword"但是您可以使用传递给的参数轻松切换到空白标记器,这是通过函数调用部分tokens()提供的选项。dfm()...

tokens(acro.test, what = "fasterword")[[1]]
## [1] "SOS"      "This"     "is"       "the"      "u.s."     "where"    "our"      "politics" "is"       "crazy"    "EOS" 

你可以看到这里,u.s.被保留了下来。 在回答您的最后一个问题时,终端.的文档频率为 3,因为它作为单独的标记出现在三个文档中,这是remove_punct = FALSE.

要将其传递给dfm()然后构造单词的文档频率的data.frame,以下代码可以工作(为了提高效率,我已经对其进行了一些整理)。请注意关于文档和词频之间差异的评论 - 我注意到有些用户对docfreq().

# I removed the options that were the same as the default 
# note also that stopwords = TRUE is not a valid argument - see remove parameter
dat.dfm <- dfm(acro.test, tolower = FALSE, remove_punct = FALSE, what = "fasterword")

# sort in descending document frequency
dat.dfm <- dat.dfm[, names(sort(docfreq(dat.dfm), decreasing = TRUE))]
# Note: this would sort the dfm in descending total term frequency
#       not the same as docfreq
# dat.dfm <- sort(dat.dfm)

# this creates the data.frame in one more efficient step
freqTable <- data.frame(ngram = featnames(dat.dfm), frequency = docfreq(dat.dfm),
                        row.names = NULL, stringsAsFactors = FALSE)
head(freqTable, 10)
##    ngram frequency
## 1    SOS         6
## 2    EOS         6
## 3    the         4
## 4     is         3
## 5   u.s.         2
## 6  crazy         2
## 7     US         2
## 8  watch         2
## 9     of         2
## 10  t.v.         2

在我看来,由 dfm 生成的命名向量docfreq()是一种比 data.frame 方法更有效的存储结果的方法,但您可能希望添加其他变量。

于 2016-04-14T18:56:14.643 回答