python - 嵌入和聚类特定文本（使用 GloVe）

Question

编辑2：我对我的问题想得更好，并意识到这是一种概括的方式，这只是一个基本的问题；

从 Glove 文件 (glove.6B.300d.txt) 创建一个新数组，其中仅包含我在文档中拥有的单词列表。

我知道这实际上与这个特定的 GloVe 文件无关，我应该学习如何为任何两个单词列表做这件事......

我假设我只是不知道如何正确地查找它以学习如何执行这部分。即我应该寻找什么库使用/功能/buuzzwords。

编辑 1：我正在添加适用于整个 GloVe 库的代码；

from __future__ import division
from sklearn.cluster import KMeans
from numbers import Number
from pandas import DataFrame
import sys, codecs, numpy
class autovivify_list(dict):
  def __missing__(self, key):
     value = self[key] = []
     return value
  def __add__(self, x):

    if not self and isinstance(x, Number):
       return x
    raise ValueError
  def __sub__(self, x):

    if not self and isinstance(x, Number):
       return -1 * x
    raise ValueError
 def build_word_vector_matrix(vector_file, n_words):
   numpy_arrays = []
   labels_array = []
   with codecs.open(vector_file, 'r', 'utf-8') as f:
      for c, r in enumerate(f):
         sr = r.split()
         labels_array.append(sr[0])
         numpy_arrays.append( numpy.array([float(i) for i in sr[1:]]) )

         if c == n_words:
           return numpy.array( numpy_arrays ), labels_array

return numpy.array( numpy_arrays ), labels_array
def find_word_clusters(labels_array, cluster_labels):
  cluster_to_words = autovivify_list()
     for c, i in enumerate(cluster_labels):
     cluster_to_words[ i ].append( labels_array[c] )
  return cluster_to_words
if __name__ == "__main__":
   input_vector_file = 
   '/Users/.../Documents/GloVe/glove.6B/glove.6B.300d.txt'
   n_words = 1000 
   reduction_factor = 0.5
   n_clusters = int( n_words * reduction_factor ) 
   df, labels_array = build_word_vector_matrix(input_vector_file, 
   n_words)
   kmeans_model = KMeans(init='k-means++', n_clusters=n_clusters, 
   n_init=10)
   kmeans_model.fit(df)

   cluster_labels  = kmeans_model.labels_
   cluster_inertia   = kmeans_model.inertia_
   cluster_to_words  = find_word_clusters(labels_array, 
   cluster_labels)

   for c in cluster_to_words:
      print cluster_to_words[c]
      print "\n"

原始问题：

假设我有一个特定的文本（比如 500 个单词）。我想做以下事情：

创建此文本中所有单词的嵌入（即只有这 500 个单词的 GloVe 向量列表）
集群它（*这个我知道怎么做）

我该怎么做这样的事情？

score 0 · Accepted Answer

这是一个非常简单的问题。根据您的描述，我推断您有 500 个单词，并且您有可用的向量。我建议您前往Scikit 学习库并为该任务应用一种标准的聚类方法。我建议从 K-means 开始。使用以下链接在 Scikit-learn 中选择正确的方法：https ://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

python - 嵌入和聚类特定文本（使用 GloVe）

我假设我只是不知道如何正确地查找它以学习如何执行这部分。即我应该寻找什么库使用/功能/buuzzwords。

1 回答 1

Related

Reference