您可以通过创建 dfm 然后对特征进行词干化,然后重新编译 dfm 以组合在词干化后相同的特征来做到这一点。
require(quanteda)
txt <- c("creatief creatieve creatie")
(dfm1 <- dfm(txt))
## Document-feature matrix of: 1 document, 3 features (0% sparse).
## 1 x 3 sparse Matrix of class "dfmSparse"
## features
## docs creatief creatieve creatie
## text1 1 1 1
这是我为您的示例近似的一个步骤,但是您可以用您自己对特征的字符向量进行的词干提取操作替换下面的右侧字符串子集函数。
# this approximates what you can do with the Python-based stemmer
# note that here you must use colnames<- since there is no function
# featnames<- (for replacement)
colnames(dfm1) <- stringi::stri_sub(featnames(dfm1), 1, 7)
dfm1
## Document-feature matrix of: 1 document, 3 features (0% sparse).
## 1 x 3 sparse Matrix of class "dfmSparse"
## features
## docs creatie creatie creatie
## text1 1 1 1
然后你可以重新编译 dfm 来编译计数。
# this combines counts in featnames that are identical
dfm_compress(dfm1)
## Document-feature matrix of: 1 document, 1 feature (0% sparse).
## 1 x 1 sparse Matrix of class "dfmSparse"
## features
## docs creatie
## text1 3
请注意,如果您使用quanteda的词干分析器,此步骤可能是dfm_wordstem()
:
dfm_wordstem(dfm1)
## Document-feature matrix of: 1 document, 1 feature (0% sparse).
## 1 x 1 sparse Matrix of class "dfmSparse"
## features
## docs creati
## text1 3