看起来 udpipe 比 kwic 对“上下文”更有意义。如果句子级别、引理和限制词类型就足够了,那么它应该是相当直截了当的。Udpipe 有荷兰模型也可预建。
#install.packages("udpipe")
library(udpipe)
#dl <- udpipe_download_model(language = "english")
# Check the name on download result
udmodel_en <- udpipe_load_model(file = "english-ud-2.0-170801.udpipe")
# Single and multisentence samples
txt <- c("Is this possible, and how? A normal language example: In
my network, I contact a lot of people through Facebook -> I would like to get co-occurrence of
network and contact (a verb) I found most of my clients through my network")
txtb <- c("I found most of my clients through my network")
x <- udpipe_annotate(udmodel_en, x = txt)
x <- as.data.frame(x)
xb <- udpipe_annotate(udmodel_en, x = txtb)
xb <- as.data.frame(xb)
# Raw preview
table(x$sentence[x$lemma == 'network'])
# Use x or xb here
xn <- udpipe_annotate(udmodel_en, x = x$sentence[x$lemma == 'network'])
xdf <- as.data.frame(xn)
# Reduce noise and group by sentence ~ doc_id to table
df_view = subset(xdf, xdf$upos %in% c('PRON','NOUN','VERB','PROPN'))
library(tidyverse)
df_view %>% group_by(doc_id) %>%
summarize(lemma = paste(sort(unique(lemma)),collapse=", "))
在快速测试中,预建模型将网络和网络定义为独立的根引理,因此一些粗略的词干分析器可能会更好地工作。然而,我确实确保在句子中包含网络会创建新的匹配。
I found most of my clients through my network
1
I would like to get co-occurrence of network and contact (a verb)
1
In my network, I contact a lot of people through Facebook ->
1
A tibble: 3 × 2
doc_id lemma
<chr> <chr>
doc1 contact, Facebook, I, lot, my, network, people
doc2 co-occurrence, contact, get, I, like, network, verb
doc3 client, find, I, my, network
完全有可能通过从匹配的引理索引上下移动来找到前面和后面的单词作为上下文,但这感觉更接近 kwic 已经在做的事情。我没有包括动态共现制表和排序,但我想现在提取上下文词时它应该是相当微不足道的部分。我认为它可能需要一些停用词等,但这些应该随着更大的数据变得更加明显。