r - 如何在 text2vec 中包含停用词（术语）

Question

在text2vec包中，我使用 create_vocabulary 函数。例如：我的文本是“这本书非常好”，假设我没有使用停用词和 1L 到 3L 的 ngram。所以词汇术语将是

这本书，非常非常好，这本书……这本书非常非常好。我只想删除术语“书非常”（以及使用向量的许多其他术语）。因为我只想删除一个短语，所以我不能使用停用词。我编写了以下代码：

vocab<-create_vocabulary(it,ngram=c(1L,3L))
vocab_mod<- subset(vocab,!(term %in% stp) # where stp is stop phrases.

x<- read.csv(Filename') #these are all stop phrases
stp<-as.vector(x$term)

当我执行上述步骤时，属性中的元信息会在 vocab_mod 中丢失，因此无法在create_dtm.

score 1 · Accepted Answer

该subset函数似乎删除了一些属性。你可以试试：

library(text2vec)
txt = "This book is very good"
it = itoken(txt)
v = create_vocabulary(it, ngram = c(1, 3))
v = v[!(v$term %in% "is_very_good"), ]    
v
# Number of docs: 1 
# 0 stopwords:  ... 
# ngram_min = 1; ngram_max = 3 
# Vocabulary: 
#   term term_count doc_count
# 1:         good          1         1
# 2: book_is_very          1         1
# 3:    This_book          1         1
# 4:         This          1         1
# 5:         book          1         1
# 6:    very_good          1         1
# 7:      is_very          1         1
# 8:      book_is          1         1
# 9: This_book_is          1         1
# 10:           is          1         1
# 11:         very          1         1
dtm = create_dtm(it, vocab_vectorizer(v))

score 0 · Accepted Answer

@Dmitriy 即使这样也可以删除属性...所以我发现的出路只是暂时使用 attr 函数手动添加属性

attr(vocab_mod,"ngram")<-c(ngram_min = 1L,ngram_max=3L) 其他属性也是如此。我们可以从 vocab 中获取属性详细信息。

r - 如何在 text2vec 中包含停用词（术语）

2 回答 2

Related

Reference