2

I used text2vec to generate custom word embeddings from a corpus of proprietary text data that contains a lot of industry-specific jargon (thus stock embeddings like those available from google won't work). The analogies work great, but I'm having difficulty applying the embeddings to assess new data. I want to use the embeddings that I've already trained to understand relationships in new data. the approach I'm using (described below) seems convoluted, and it's painfully slow. Is there a better approach? Perhaps something already built into the package that I've simply missed?

Here's my approach (offered with the closest thing to reproducible code I can generate given that I'm using a proprietary data source):

d = list containing new data. each element is of class character

vecs = the word vectorizations obtained form text2vec's implementation of glove

  new_vecs <- sapply(d, function(y){             
                    it <- itoken(word_tokenizer(y), progressbar=FALSE) # for each statement, create an iterator punctuation
                    voc <- create_vocabulary(it, stopwords= tm::stopwords()) # for each document, create a vocab 
                    vecs[rownames(vecs) %in% voc$vocab$terms, , drop=FALSE] %>% # subset vecs for the words in the new document, then 
                    colMeans # find the average vector for each document
                    })  %>% t # close y function and sapply, then transpose to return matrix w/ one row for each statement

For my use case, I need to keep the results separate for each document, so anything that involves pasting-together the elements of d won't work, but surely there must be a better way than what I've cobbled together. I feel like I must be missing something rather obvious.

Any help will be greatly appreciated.

4

1 回答 1

7

您需要使用有效的线性代数矩阵运算在“批处理”模式下进行。这个想法是为文档提供文档术语矩阵d。该矩阵将包含有关每个单词在每个文档中出现多少次的信息。然后只需要乘以dtm嵌入矩阵:

library(text2vec)
# we are interested in words which are in word embeddings
voc = create_vocabulary(rownames(vecs))
# now we will create document-term matrix
vectorizer = vocab_vectorizer(voc)
dtm = itoken(d, tokenizer = word_tokenizer) %>% 
  create_dtm(vectorizer)

# normalize - calculate term frequaency - i.e. divide count of each word 
# in document by total number of words in document. 
# So at the end we will receive average of word vectors (not sum of word vectors!)
dtm = normalize(dtm)
# and now we can calculate vectors for document (average of vecors of words)
# using dot product of dtm and embeddings matrix
document_vecs = dtm %*% vecs
于 2017-02-03T06:38:59.997 回答