r - 在 Text2vec 中实现 Arora 2017

Question

我正在尝试使用 text2vec复制 Arora 2017 ( https://github.com/PrincetonML/SIF / https://openreview.net/forum?id=SyK00v5xx )。作者通过平均词嵌入并减去第一个主成分来计算句子嵌入。

感谢 text2vec 的作者，我可以计算手套嵌入并对其进行平均。下一步是计算主成分 /svd 并从嵌入中减去第一个成分。

我可以使用 irlba 包（我相信它也在 tex2vec 中使用）计算 svd，但是接下来我被困在如何准确地从平均词嵌入中减去主成分。

论文中的python代码（https://github.com/PrincetonML/SIF/blob/master/src/SIF_embedding.py）具有以下功能

def remove_pc(X, npc=1):
"""
Remove the projection on the principal components
:param X: X[i,:] is a data point
:param npc: number of principal components to remove
:return: XX[i, :] is the data point after removing its projection
"""
pc = compute_pc(X, npc)
if npc==1:
    XX = X - X.dot(pc.transpose()) * pc
else:
    XX = X - X.dot(pc.transpose()).dot(pc)
return XX

我的 R 代码是

# get the word vectors
wv_context = glove$components
word_vectors = wv_main + t(wv_context)

# create document term matrix
dtm = create_dtm(it, vectorizer)

# assign the word embeddings
common_terms = intersect(colnames(dtm), rownames(word_vectors) )

# normalise
dtm_averaged <-  text2vec::normalize(dtm[, common_terms], "l1")

例如，如果我有 1K 个句子 x 300 个变量，我运行 irlba 函数得到三个矩阵。例如，这些具有 4 个分量 x 1K 观察值。

如何转换此函数的输出（1K x 变量/组件），以便我可以从句子嵌入（1K x 300 变量）中减去组件？

谢谢！

score 0 · Accepted Answer

想法是，使用截断的 SVD/PCA，您可以以最小的平方误差重建原始矩阵。所以你得到一个形式的 SVD (U, D, V)，原始矩阵的重建是A ~ U * D * t(V). 现在我们从原始矩阵中减去这个重建——这正是作者提出的。这是示例：

library(text2vec)
data("movie_review")

it = itoken(movie_review$review, preprocessor = tolower, tokenizer = word_tokenizer)
dtm = create_dtm(it, hash_vectorizer(2**14))

lsa = LSA$new(n_topics = 64)
doc_emb = lsa$fit_transform(dtm)

doc_emb_pc1 = doc_emb_svd$u %*% doc_emb_svd$d %*% t(doc_emb_svd$v)
doc_emb_minus_pc1 = doc_emb - doc_emb_pc1

如果您有机会完成您的实现，请考虑将其贡献给 text2vec - 这是 Arora 句子嵌入的门票 - https://github.com/dselivanov/text2vec/issues/157。

r - 在 Text2vec 中实现 Arora 2017

1 回答 1

Related

Reference