我想分析一个大(n = 500,000)个文档语料库。我正在使用quanteda
预期 会比tm_map()
from更快tm
。我想逐步进行,而不是使用dfm()
. 我有这样做的理由:在一种情况下,我不想在删除停用词之前进行标记,因为这会导致许多无用的二元组,在另一种情况下,我必须使用特定于语言的程序对文本进行预处理。
我希望实现这个序列:
1)删除标点符号和数字
2)删除停用词(即在标记化之前以避免无用的标记)
3)使用 unigrams 和 bigrams 进行标记
4)创建 dfm
我的尝试:
> library(quanteda)
> packageVersion("quanteda")
[1] ‘0.9.8’
> text <- ie2010Corpus$documents$texts
> text.corpus <- quanteda:::corpus(text, docnames=rownames(ie2010Corpus$documents))
> class(text.corpus)
[1] "corpus" "list"
> stopw <- c("a","the", "all", "some")
> TextNoStop <- removeFeatures(text.corpus, features = stopw)
# Error in UseMethod("selectFeatures") :
# no applicable method for 'selectFeatures' applied to an object of class "c('corpus', 'list')"
# This is how I would theoretically continue:
> token <- tokenize(TextNoStop, removePunct=TRUE, removeNumbers=TRUE)
> token2 <- ngrams(token,c(1,2))
奖金问题
如何删除稀疏令牌quanteda
?(即相当于removeSparseTerms()
in tm
。
更新
根据@Ken的回答,这里是逐步进行的代码quanteda
:
library(quanteda)
packageVersion("quanteda")
[1] ‘0.9.8’
1)删除自定义标点和数字。例如,请注意 ie2010 语料库中的“\n”
text.corpus <- ie2010Corpus
texts(text.corpus)[1] # Use texts() to extrapolate text
# 2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery.\nIt is
texts(text.corpus)[1] <- gsub("\\s"," ",text.corpus[1]) # remove all spaces (incl \n, \t, \r...)
texts(text.corpus)[1]
2010_BUDGET_01_Brian_Lenihan_FF
# "When I presented the supplementary budget to this House last April, I said we
# could work our way through this period of severe economic distress. Today, I
# can report that notwithstanding the difficulties of the past eight months, we
# are now on the road to economic recovery. It is of e
关于人们可能更喜欢预处理的原因的进一步说明。我目前的语料库是意大利语,这种语言的文章与带有撇号的单词相关联。因此,顺子dfm()
可能导致不精确的标记化。例如:
broken.tokens <- dfm(corpus(c("L'abile presidente Renzi. Un'abile mossa di Berlusconi"), removePunct=TRUE))
将为同一个单词(“un'abile”和“l'abile”)生成两个单独的标记,因此这里需要一个额外的步骤gsub()
。
2) 在quanteda
标记化之前,无法直接在文本中删除停用词。在我之前的示例中,必须删除“l”和“un”,以免产生误导性的二元组。这可以tm
用tm_map(..., removeWords)
.
3) 代币化
token <- tokenize(text.corpus[1], removePunct=TRUE, removeNumbers=TRUE, ngrams = 1:2)
4)创建dfm:
dfm <- dfm(token)
5)去除稀疏特征
dfm <- trim(dfm, minCount = 5)