python - 模糊分组依据，对相似词进行分组

Question

之前在这里问过这个问题

但没有给出关于如何“分组”项目的明确答案。基于 difflib 的解决方案基本上是搜索，对于给定的项目，difflib 可以从列表中返回最相似的单词。但是这如何用于分组呢？

我想减少

['ape', 'appel', 'apple', 'peach', 'puppy']

到

['ape', 'appel', 'peach', 'puppy']

或者

['ape', 'apple', 'peach', 'puppy']

我尝试过的一个想法是，对于每个项目，遍历列表，如果 get_close_matches 返回多个匹配项，则使用它，如果不保持原样。这部分有效，但它可以建议 apple for appel，然后 appel for apple，这些词只会换位，什么都不会改变。

我将不胜感激任何指针，库名称等。

注意：同样在性能方面，我们有一个 300,000 个项目的列表，get_close_matches 似乎有点慢。有人知道那里有基于 C/++ 的解决方案吗？

谢谢，

注意：进一步调查显示 kmedoid 是正确的算法（以及层次聚类），因为 kmedoid 不需要“中心”，它使用/使用数据点本身作为中心（这些点被称为 medoids，因此得名）。在词组的情况下，中心点将是该组/集群的代表元素。

score 5 · Accepted Answer

您需要对组进行规范化。在每个组中，选择一个代表该组的单词或编码。然后按他们的代表对单词进行分组。

一些可能的方法：

选择第一个遇到的单词。
选择字典顺序的第一个单词。
推导出所有单词的模式。
选择一个唯一索引。
使用soundex作为模式。

但是，将单词分组可能很困难。如果 A 与 B 相似，B 与 C 相似，则 A 和 C 不一定彼此相似。如果 B 是代表，则 A 和 C 都可以包含在该组中。但如果 A 或 C 是代表，则不能包括另一个。

按照第一个选择（第一个遇到的单词）：

class Seeder:
    def __init__(self):
        self.seeds = set()
        self.cache = dict()

    def get_seed(self, word):
        LIMIT = 2
        seed = self.cache.get(word,None)
        if seed is not None:
            return seed
        for seed in self.seeds:
            if self.distance(seed, word) <= LIMIT:
                self.cache[word] = seed
                return seed
        self.seeds.add(word)
        self.cache[word] = word
        return word

    def distance(self, s1, s2):
        l1 = len(s1)
        l2 = len(s2)
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1]

import itertools

def group_similar(words):
    seeder = Seeder()
    words = sorted(words, key=seeder.get_seed)
    groups = itertools.groupby(words, key=seeder.get_seed)
    return [list(v) for k,v in groups]

例子：

import pprint

print pprint.pprint(group_similar([
    'the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
    'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
    'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
    'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
    'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
    'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
    'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
    'people', 'into', 'year', 'your', 'good', 'some', 'could',
    'them', 'see', 'other', 'than', 'then', 'now', 'look',
    'only', 'come', 'its', 'over', 'think', 'also', 'back',
    'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
    'way', 'even', 'new', 'want', 'because', 'any', 'these',
    'give', 'day', 'most', 'us'
]), width=120)

输出：

[['after'],
 ['also'],
 ['and', 'a', 'in', 'on', 'as', 'at', 'an', 'one', 'all', 'can', 'no', 'want', 'any'],
 ['back'],
 ['because'],
 ['but', 'about', 'get', 'just'],
 ['first'],
 ['from'],
 ['good', 'look'],
 ['have', 'make', 'give'],
 ['his', 'her', 'if', 'him', 'its', 'how', 'us'],
 ['into'],
 ['know', 'new'],
 ['like', 'time', 'take'],
 ['most'],
 ['of', 'I', 'it', 'for', 'not', 'he', 'you', 'do', 'by', 'we', 'or', 'my', 'so', 'up', 'out', 'go', 'me', 'now'],
 ['only'],
 ['over', 'our', 'even'],
 ['people'],
 ['say', 'she', 'way', 'day'],
 ['some', 'see', 'come'],
 ['the', 'be', 'to', 'that', 'this', 'they', 'there', 'their', 'them', 'other', 'then', 'use', 'two', 'these'],
 ['think'],
 ['well'],
 ['what', 'who', 'when', 'than'],
 ['with', 'will', 'which'],
 ['work'],
 ['would', 'could'],
 ['year', 'your']]

score 3 · Accepted Answer

您必须在封闭匹配词中决定要使用哪些词。可能是从 get_close_matches 返回的列表中获取第一个元素，或者只是在该列表上使用随机函数并从封闭匹配中获取一个元素。

必须有某种规则，因为它..

In [19]: import difflib

In [20]: a = ['ape', 'appel', 'apple', 'peach', 'puppy']

In [21]: a = ['appel', 'apple', 'peach', 'puppy']

In [22]: b = difflib.get_close_matches('ape',a)

In [23]: b
Out[23]: ['apple', 'appel']

In [24]: import random

In [25]: c = random.choice(b)

In [26]: c
Out[26]: 'apple'

In [27]:

现在从初始列表中删除 c，就是这样...对于 c++，您可以使用Levenshtein_distance

score 3 · Accepted Answer

这是另一个使用 Affinity Propagation 算法的版本。

import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import AffinityPropagation
import itertools

words = np.array(
    ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
     'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
     'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
     'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
     'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
     'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
     'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
     'people', 'into', 'year', 'your', 'good', 'some', 'could',
     'them', 'see', 'other', 'than', 'then', 'now', 'look',
     'only', 'come', 'its', 'over', 'think', 'also', 'back',
     'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
     'way', 'even', 'new', 'want', 'because', 'any', 'these',
     'give', 'day', 'most', 'us'])

print "calculating distances..."

(dim,) = words.shape

f = lambda (x,y): -leven.distance(x,y)

res=np.fromiter(itertools.imap(f, itertools.product(words, words)), dtype=np.uint8)
A = np.reshape(res,(dim,dim))

af = AffinityPropagation().fit(A)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_

unique_labels = set(labels)
for i in unique_labels:
    print words[labels==i]

距离必须转换为相似度，我通过取负数来做到这一点。输出是

['to' 'you' 'do' 'by' 'so' 'who' 'go' 'into' 'also' 'two']
['it' 'with' 'at' 'if' 'get' 'its' 'first']
['of' 'for' 'from' 'or' 'your' 'look' 'after' 'work']
['the' 'be' 'have' 'I' 'he' 'we' 'her' 'she' 'me' 'give']
['this' 'his' 'which' 'him']
['and' 'a' 'in' 'an' 'my' 'all' 'can' 'any']
['on' 'one' 'good' 'some' 'see' 'only' 'come' 'over']
['would' 'could']
['but' 'out' 'about' 'our' 'most']
['make' 'like' 'time' 'take' 'back']
['that' 'they' 'there' 'their' 'when' 'them' 'other' 'than' 'then' 'think'
 'even' 'these']
['not' 'no' 'know' 'now' 'how' 'new']
['will' 'people' 'year' 'well']
['say' 'what' 'way' 'want' 'day']
['because']
['as' 'up' 'just' 'use' 'us']

score 1 · Accepted Answer

另一种方法可能是使用矩阵分解，使用 SVD。首先我们创建单词距离矩阵，对于 100 个单词，这将是 100 x 100 矩阵，表示每个单词到所有其他单词的距离。然后，在这个矩阵上运行 SVD，得到的 u,s,v 中的 u 可以看作是每个集群的成员强度。

代码

import numpy as np
import scipy.linalg as lin
import Levenshtein as leven
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
import itertools

words = np.array(
    ['the', 'be', 'to', 'of', 'and', 'a', 'in', 'that', 'have',
     'I', 'it', 'for', 'not', 'on', 'with', 'he', 'as', 'you',
     'do', 'at', 'this', 'but', 'his', 'by', 'from', 'they', 'we',
     'say', 'her', 'she', 'or', 'an', 'will', 'my', 'one', 'all',
     'would', 'there', 'their', 'what', 'so', 'up', 'out', 'if',
     'about', 'who', 'get', 'which', 'go', 'me', 'when', 'make',
     'can', 'like', 'time', 'no', 'just', 'him', 'know', 'take',
     'people', 'into', 'year', 'your', 'good', 'some', 'could',
     'them', 'see', 'other', 'than', 'then', 'now', 'look',
     'only', 'come', 'its', 'over', 'think', 'also', 'back',
     'after', 'use', 'two', 'how', 'our', 'work', 'first', 'well',
     'way', 'even', 'new', 'want', 'because', 'any', 'these',
     'give', 'day', 'most', 'us'])

print "calculating distances..."

(dim,) = words.shape

f = lambda (x,y): leven.distance(x,y)
res=np.fromiter(itertools.imap(f, itertools.product(words, words)),
                dtype=np.uint8)
A = np.reshape(res,(dim,dim))

print "svd..."

u,s,v = lin.svd(A, full_matrices=False)

print u.shape
print s.shape
print s
print v.shape

data = u[:,0:10]
k=KMeans(init='k-means++', k=25, n_init=10)
k.fit(data)
centroids = k.cluster_centers_
labels = k.labels_
print labels

for i in range(np.max(labels)):
    print words[labels==i]

def dist(x,y):   
    return np.sqrt(np.sum((x-y)**2, axis=1))

print "centroid points.."
for i,c in enumerate(centroids):
    idx = np.argmin(dist(c,data[labels==i]))
    print words[labels==i][idx]
    print words[labels==i]

plt.plot(centroids[:,0],centroids[:,1],'x')
plt.hold(True)
plt.plot(u[:,0], u[:,1], '.')
plt.show()

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = Axes3D(fig)
ax.plot(u[:,0], u[:,1], u[:,2],'.', zs=0,
        zdir='z', label='zs=0, zdir=z')
plt.show()

结果

any
['and' 'an' 'can' 'any']
do
['to' 'you' 'do' 'so' 'go' 'no' 'two' 'how']
when
['who' 'when' 'well']
my
['be' 'I' 'by' 'we' 'my' 'up' 'me' 'use']
your
['for' 'or' 'out' 'about' 'your' 'our']
its
['it' 'his' 'if' 'him' 'its']
could
['would' 'people' 'could']
this
['this' 'think' 'these']
she
['the' 'he' 'she' 'see']
back
['all' 'back' 'want']
one
['of' 'on' 'one' 'only' 'even' 'new']
just
['but' 'just' 'first' 'most']
come
['some' 'come']
that
['that' 'than']
way
['say' 'what' 'way' 'day']
like
['like' 'time' 'give']
in
['in' 'into']
get
['her' 'get' 'year']
because
['because']
will
['with' 'will' 'which']
over
['other' 'over' 'after']
as
['a' 'as' 'at' 'also' 'us']
them
['they' 'there' 'their' 'them' 'then']
good
['not' 'from' 'know' 'good' 'now' 'look' 'work']
have
['have' 'make' 'take']

为集群数量选择 k 很重要，例如，k=25 比 k=20 给出更好的结果。

该代码还通过选择其 u[..] 坐标最接近集群质心的词来为每个集群选择一个代表词。

score 0 · Accepted Answer

这是一种基于 medoids 的方法。首先安装 MlPy。在 Ubuntu 上

sudo apt-get install python-mlpy

然后

import numpy as np
import mlpy

class distance:    
    def compute(self, s1, s2):
        l1 = len(s1)
        l2 = len(s2)
        matrix = [range(zz,zz + l1 + 1) for zz in xrange(l2 + 1)]
        for zz in xrange(0,l2):
            for sz in xrange(0,l1):
                if s1[sz] == s2[zz]:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz])
                else:
                    matrix[zz+1][sz+1] = min(matrix[zz+1][sz] + 1, matrix[zz][sz+1] + 1, matrix[zz][sz] + 1)
        return matrix[l2][l1]

x =  np.array(['ape', 'appel', 'apple', 'peach', 'puppy'])

km = mlpy.Kmedoids(k=3, dist=distance())
medoids,clusters,a,b = km.compute(x)

print medoids
print clusters
print a

print x[medoids] 
for i,c in enumerate(x[medoids]):
    print "medoid", c
    print x[clusters[a==i]]

输出是

[4 3 1]
[0 2]
[2 2]
['puppy' 'peach' 'appel']
medoid puppy
[]
medoid peach
[]
medoid appel
['ape' 'apple']

更大的单词列表并使用 k=10

medoid he
['or' 'his' 'my' 'have' 'if' 'year' 'of' 'who' 'us' 'use' 'people' 'see'
 'make' 'be' 'up' 'we' 'the' 'one' 'her' 'by' 'it' 'him' 'she' 'me' 'over'
 'after' 'get' 'what' 'I']
medoid out
['just' 'only' 'your' 'you' 'could' 'our' 'most' 'first' 'would' 'but'
 'about']
medoid to
['from' 'go' 'its' 'do' 'into' 'so' 'for' 'also' 'no' 'two']
medoid now
['new' 'how' 'know' 'not']
medoid time
['like' 'take' 'come' 'some' 'give']
medoid because
[]
medoid an
['want' 'on' 'in' 'back' 'say' 'and' 'a' 'all' 'can' 'as' 'way' 'at' 'day'
 'any']
medoid look
['work' 'good']
medoid will
['with' 'well' 'which']
medoid then
['think' 'that' 'these' 'even' 'their' 'when' 'other' 'this' 'they' 'there'
 'than' 'them']

python - 模糊分组依据，对相似词进行分组

5 回答 5

Related

Reference