我正在做一些行为分析,我会随着时间的推移跟踪行为,然后创建这些行为的 n-gram。
sample_n_gram_list = [['scratch', 'scratch', 'scratch', 'scratch', 'scratch'],
['scratch', 'scratch', 'scratch', 'scratch', 'smell/sniff'],
['scratch', 'scratch', 'scratch', 'sit', 'stand']]
我希望能够对这些 n-gram 进行聚类,但我需要使用自定义指标创建一个预先计算的距离矩阵。我的指标似乎工作正常,但是当我尝试使用 sklearn 函数创建距离矩阵时,出现错误:
ValueError: could not convert string to float: 'scratch'
我查看了文档https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html,关于这个主题并不是特别清楚。
任何熟悉如何正确使用它的人?
完整代码如下:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import matplotlib.mlab as mlab
import math
import hashlib
import networkx as nx
import itertools
import hdbscan
from sklearn.metrics.pairwise import pairwise_distances
def get_levenshtein_distance(path1, path2):
"""
https://en.wikipedia.org/wiki/Levenshtein_distance
:param path1:
:param path2:
:return:
"""
matrix = [[0 for x in range(len(path2) + 1)] for x in range(len(path1) + 1)]
for x in range(len(path1) + 1):
matrix[x][0] = x
for y in range(len(path2) + 1):
matrix[0][y] = y
for x in range(1, len(path1) + 1):
for y in range(1, len(path2) + 1):
if path1[x - 1] == path2[y - 1]:
matrix[x][y] = min(
matrix[x - 1][y] + 1,
matrix[x - 1][y - 1],
matrix[x][y - 1] + 1
)
else:
matrix[x][y] = min(
matrix[x - 1][y] + 1,
matrix[x - 1][y - 1] + 1,
matrix[x][y - 1] + 1
)
return matrix[len(path1)][len(path2)]
sample_n_gram_list = [['scratch', 'scratch', 'scratch', 'scratch', 'scratch'],
['scratch', 'scratch', 'scratch', 'scratch', 'smell/sniff'],
['scratch', 'scratch', 'scratch', 'sit', 'stand']]
print("should be 0")
print(get_levenshtein_distance(sample_n_gram_list[1],sample_n_gram_list[1]))
print("should be 1")
print(get_levenshtein_distance(sample_n_gram_list[1],sample_n_gram_list[0]))
print("should be 2")
print(get_levenshtein_distance(sample_n_gram_list[0],sample_n_gram_list[2]))
clust_number = 2
distance_matrix = pairwise_distances(sample_n_gram_list, metric=get_levenshtein_distance)
clusterer = hdbscan.HDBSCAN(metric='precomputed')
clusterer.fit(distance_matrix)
clusterer.labels_