1

我正在使用Quanteda 软件包套件来预处理一些文本数据。我想将搭配合并为功能,并决定使用textstat_collocations功能。根据文档,我引用:

标记对象......虽然支持识别标记对象的搭配,但由于从已经标记化的文本中相对不完善的句子边界检测,您将获得更好的字符或语料库对象结果。

这很有意义,所以这里是:

library(dplyr)
library(tibble)
library(quanteda)
library(quanteda.textstats)

# Some sample data and lemmas
df= c("this column has a lot of missing data, 50% almost!",
                   "I am interested in missing data problems",
                   "missing data is a headache",
                   "how do you handle missing data?")

lemmas <- data.frame() %>%
    rbind(c("missing", "miss")) %>%
    rbind(c("data", "datum")) %>%
    `colnames<-`(c("inflected_form", "lemma"))

(1) 使用语料库对象生成搭配:

txtCorpus = corpus(df)
docvars(txtCorpus)$text <- as.character(txtCorpus)
myPhrases = textstat_collocations(txtCorpus, tolower = FALSE)

(2) 预处理文本并识别搭配并为下游任务进行词形还原。

# I used a blank space as concatenator and the phrase function as explained in the documentation and I followed the multi multi substitution example in the documentation
# https://quanteda.io/reference/tokens_replace.html
txtTokens = tokens(txtCorpus, remove_numbers = TRUE, remove_punct = TRUE, 
                               remove_symbols = TRUE, remove_separators = TRUE) %>%
    tokens_tolower() %>%
    tokens_compound(pattern = phrase(myPhrases$collocation), concatenator = " ") %>%
    tokens_replace(pattern=phrase(c(lemmas$inflected_form)), replacement=phrase(c(lemmas$lemma)))

(3) 测试结果

# Create dtm
dtm = dfm(txtTokens, remove_padding = TRUE)

# pull features
dfm_feat = as.data.frame(featfreq(dtm)) %>%
    rownames_to_column(var="feature") %>%
    `colnames<-`(c("feature", "count"))

dfm_feat
特征 数数
1
柱子 1
拥有 1
一种 2
很多 1
1
几乎 1
一世 2
1
感兴趣的 1
1
问题 1
1
头痛 1
如何 1
1
1
处理 1
缺失数据 4

缺失数据”应该是“缺失数据”。

这仅在 df 中的每个文档都是一个单词时才有效。如果我从一开始就使用令牌对象生成我的搭配,我可以使这个过程正常工作,但这不是我想要的。

4

1 回答 1

2

问题是您已经将搭配的元素复合成一个包含空格的“标记”,但是通过在 中提供phrase()包装器tokens_compound(),您是在告诉tokens_replace()寻找两个连续的标记,而不是带有空格的标记。

获得你想要的东西的方法是使词形替换匹配搭配。

phrase_lemmas <- data.frame(
  inflected_form = "missing data",
  lemma = "miss datum"
)
tokens_replace(txtTokens, phrase_lemmas$inflected_form, phrase_lemmas$lemma)
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
## [1] "this"       "column"     "has"        "a"          "lot"       
## [6] "of"         "miss datum" "almost"    
## 
## text2 :
## [1] "i"          "am"         "interested" "in"         "miss datum"
## [6] "problems"  
## 
## text3 :
## [1] "miss datum" "is"         "a"          "headache"  
## 
## text4 :
## [1] "how"        "do"         "you"        "handle"     "miss datum"

Alternatives would be to use tokens_lookup() on uncompounded tokens directly, if you have a fixed listing of sequences you want to match to lemmatised sequences. E.g.,

tokens(txtCorpus) %>%
  tokens_lookup(dictionary(list("miss datum" = "missing data")),
    exclusive = FALSE, capkeys = FALSE
  )
## Tokens consisting of 4 documents and 1 docvar.
## text1 :
##  [1] "this"       "column"     "has"        "a"          "lot"       
##  [6] "of"         "miss datum" ","          "50"         "%"         
## [11] "almost"     "!"         
## 
## text2 :
## [1] "I"          "am"         "interested" "in"         "miss datum"
## [6] "problems"  
## 
## text3 :
## [1] "miss datum" "is"         "a"          "headache"  
## 
## text4 :
## [1] "how"        "do"         "you"        "handle"     "miss datum"
## [6] "?"
于 2021-09-04T09:21:46.017 回答