1

kwic我正在使用 R进行文本挖掘,我遇到了一个我想解决的“问题” quanteda...

result <- kwic (corp2,c(phrase("trous oblongs")))

corp2语料库在哪里。trous oblongs是法语,它是复数形式。但是,当我这样做时,我只会得到包含复数表达式的报告。我还想考虑单数形式的出现trou oblong(反之亦然,如果我最初输入代码trou oblong,也得到复数形式)。

我知道这个udpipe包,由于它的udpipe_annotate功能:https ://www.rdocumentation.org/packages/udpipe/versions/0.3/topics/udpipe_annotate ,能够提取文本中单词的引理。

所以我想知道是否udpipe有一个功能可以设法找到语料库中具有相同引理的单词的所有出现,或者是否可以使用kwic.

提前致谢

4

2 回答 2

2

Quantedatokens_wordstem()使用了 SnoballC 的词干分析器:

toks <- tokens(corp2)
toks_stem <- tokens_wordstem(toks, "french")
kwic(toks_stem, phrase("trous oblong"))

或者,您也可以使用 * 通配符来搜索词干:

toks <- tokens(corp2)
kwic(toks, phrase("trou* oblong*"))
于 2018-04-07T17:57:37.810 回答
0

If you want to stick in the udpipe framework, you can either use txt_nextgram with txt_recode_ngram or use the dependency parsing results if your 2 terms do not follow one another but you still want to find it.

library(udpipe)
library(data.table)
txt <- c("Les trous sont oblongs.", 
         "Les trous oblongs du systeme de montage des deux parties permettent un reglage facile et un alignement precis.")

## Annotate with udpipe to tokenise, obtain pos tags, lemmas, dependency parsing output
udmodel <- udpipe_download_model("french-sequoia", udpipe_model_repo = "bnosac/udpipe.models.ud")
udmodel <- udpipe_load_model(udmodel$file_model)
x <- udpipe_annotate(udmodel, txt)
x <- as.data.table(x)

## Situation 1: words are following one another
x <- x[, lemma_bigram := txt_nextgram(lemma, n = 2, sep = " "), by = list(doc_id, paragraph_id, sentence_id)]
subset(x, lemma_bigram %in% c("trous oblong"))

## Situation 2: words are not following one another - use dependency parsing results
x <- merge(x, 
           x[, c("doc_id", "paragraph_id", "sentence_id", "token_id", "token", "lemma", "upos", "xpos"), with = FALSE], 
           by.x = c("doc_id", "paragraph_id", "sentence_id", "head_token_id"),
           by.y = c("doc_id", "paragraph_id", "sentence_id", "token_id"),
           all.x = TRUE, all.y = FALSE, 
           suffixes = c("", "_parent"),
           sort = FALSE)
subset(x, lemma_bigram %in% c("trous oblong") | (lemma %in% "trous" & lemma_parent %in% "oblong"))

If you want to recode keywords to 1 term (only covers situation 1):

x$term <- txt_recode_ngram(x$lemma, compound = "trous oblong", ngram = 2, sep = " ")
于 2018-04-07T21:01:01.123 回答