python - nltk k-means clustering or k-means with pure python

Question

虽然我已经看到了很多与此相关的问题，但我并没有真正得到答案，可能是因为我是使用 nltk 聚类的新手。我真的需要一个基本的解释来帮助新手进行聚类，尤其是关于 NLTK K-mean 聚类的向量表示以及如何使用它。我有一个像 [cat, dog, kitten, puppy etc] 这样的单词列表和另外两个像 [carnivore, herbivore, pet, etc] 和 [mammal, domestic etc] 这样的单词列表。我希望能够使用第一个作为手段或质心基于第一个单词列表对最后两个单词列表进行聚类。我试过了，我收到了这样的 AssertionError：

clusterer = cluster.KMeansClusterer(2, euclidean_distance, initial_means=means)
  File "C:\Python27\lib\site-packages\nltk\cluster\kmeans.py", line 64, in __init__
    assert not initial_means or len(initial_means) == num_means

AND
    print clusterer.cluster(vectors, True)
  File "C:\Python27\lib\site-packages\nltk\cluster\util.py", line 55, in cluster
    self.cluster_vectorspace(vectors, trace)
  File "C:\Python27\lib\site-packages\nltk\cluster\kmeans.py", line 82, in cluster_vectorspace
    self._cluster_vectorspace(vectors, trace)
  File "C:\Python27\lib\site-packages\nltk\cluster\kmeans.py", line 113, in _cluster_vectorspace
    index = self.classify_vectorspace(vector)
  File "C:\Python27\lib\site-packages\nltk\cluster\kmeans.py", line 137, in classify_vectorspace
    dist = self._distance(vector, mean)
  File "C:\Python27\lib\site-packages\nltk\cluster\util.py", line 118, in euclidean_distance
    diff = u - v
TypeError: unsupported operand type(s) for -: 'numpy.ndarray' and 'numpy.ndarray'

我认为我在矢量表示中有一些意思。矢量表示和示例代码的基本示例将受到高度赞赏。任何使用 nltk 或纯 python 的解决方案都将受到赞赏。提前感谢您的友好回复

score 2 · Accepted Answer

如果我正确理解你的问题，这样的事情应该可以工作。kmeans 的难点在于找到聚类中心，如果您已经找到了这些中心或知道您想要什么中心，您可以：为每个点找到到每个聚类中心的距离，并将该点分配给最近的聚类中心。

（作为旁注， sklearn是一个很好的集群和机器学习包。）

在您的示例中，它应该如下所示：

列文斯坦

# levenstein function is not my implementation; I copied it from the 
# link above 
def levenshtein(s1, s2):
    if len(s1) < len(s2):
        return levenshtein(s2, s1)

    # len(s1) >= len(s2)
    if len(s2) == 0:
        return len(s1)

    previous_row = xrange(len(s2) + 1)
    for i, c1 in enumerate(s1):
        current_row = [i + 1]
        for j, c2 in enumerate(s2):
            insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer
            deletions = current_row[j] + 1       # than s2
            substitutions = previous_row[j] + (c1 != c2)
            current_row.append(min(insertions, deletions, substitutions))
        previous_row = current_row

    return previous_row[-1]

def get_closest_lev(cluster_center_words, my_word):
    closest_center = None
    smallest_distance = float('inf')
    for word in cluster_center_words:
        ld = levenshtein(word, my_word)
        if ld < smallest_distance:
            smallest_distance = ld
            closest_center = word
    return closest_center

def get_clusters(cluster_center_words, other_words):
    cluster_dict = {}
    for word in cluster_center_words:
        cluster_dict[word] = []
    for my_word in other_words:
        closest_center = get_closest_lev(cluster_center_words, my_word)
        cluster_dict[closest_center].append(my_word)
    return cluster_dict

例子：

cluster_center_words = ['dog', 'cat']
other_words = ['dogg', 'kat', 'frog', 'car']

结果：

>>> get_clusters(cluster_center_words, other_words)
{'dog': ['dogg', 'frog'], 'cat': ['kat', 'car']}

python - nltk k-means clustering or k-means with pure python

1 回答 1

Related

Reference