0

我正在尝试从 Quanteda dfm 中提取已识别的字典单词,但一直无法找到解决方案。

有人对此有解决方案吗?

样本输入:

dict <- dictionary(list(season = c("spring", "summer", "fall", "winter")))
dfm  <- dfm("summer is great", dictionary  = dict)

输出:

 > dfm
 Document-feature matrix of: 1 document, 1 feature.
 1 x 1 sparse Matrix of class "dfmSparse"

   features
docs    season
text1      1

我现在知道句子中已经确定了一个季节性字典词,但我也想知道它是哪个词。

这最好以表格格式提取:

docs    dict     dictWord
text1   season   summer
4

1 回答 1

1

您可以使用参数创建第二个 dfm keptFeatures,然后cbind()将其创建到第一个字典式 dfm。

dict <- dictionary(list(season = c("spring", "summer", "fall", "winter")))
txt <- "summer is great"
season_dfm  <- dfm(txt, dictionary  = dict, verbose = FALSE)
dict_dfm <- dfm(txt, select = dict, verbose = FALSE)

cbind(season_dfm, dict_dfm)
## Document-feature matrix of: 1 document, 2 features.
## 1 x 2 sparse Matrix of class "dfmSparse"
##       season summer
## text1      1      1

如果您希望将输出作为表格,它将是:

dict_df <- as.data.frame(combined_dfm)
names(dict_df)[2] <- "dictWord"
dict_df
##       season dictWord
## text1      1        1

但这只有在每个文本都有一个字典值时才有效——否则dict_dfm将有多个特征列。

于 2016-09-29T12:30:20.213 回答