嗨,我有一个关于将 MDS 与 Python 一起应用的非常具体、奇怪的问题。
在创建原始高维数据集的距离矩阵(我们称之为 distanceHD)时,您可以使用欧几里得距离或曼哈顿距离测量所有数据点之间的距离。
然后,在执行 MDS 之后,假设我将 70 多列减少到 2 列。现在,我可以创建一个新的距离矩阵。我们称它为 distance2D,它再次测量数据点之间的距离,无论是在曼哈顿还是欧几里得。
最后,我可以找到两个距离矩阵之间的差异(distanceHD 和 distance2D 之间),如果我将大型数据集中的数据点之间的距离保留到列数更少的新数据集中,这个新的差异矩阵会告诉我。(在执行 MDS 之后)。然后我可以使用该差分矩阵上的应力函数计算应力,并且该数字越接近 0,投影越好。
我的问题:我最初被教导在 distanceHD 矩阵中使用曼哈顿距离,并在 distance2D 矩阵中使用欧几里得距离。但为什么?为什么不在两者上都使用曼哈顿?还是欧几里得?还是距离HD上的欧几里得和距离2D上的曼哈顿距离?
我想还有一个整体问题:我什么时候在 MDS 算法上使用任一距离度量?
很抱歉这篇冗长且可能令人困惑的帖子。我有一个示例如下所示:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataHD = pd.DataFrame(
[[0,0,0,0],
[1,1,1,1],
[0,1,2,3],
[0,0,0,1]],
index=['A','B','C','D'],
columns=['1','2','3','4'])
dataHD
import sklearn.metrics.pairwise as smp
distHD = smp.manhattan_distances(dataHD) #L1 Distance Function
distHD = pd.DataFrame(distHD, columns=dataHD.index, index=dataHD.index)
distHD
import sklearn.manifold
# Here were going to find the local min/maxs
# the disimilarity parameter is referencing the distance matrix
# shift + tab will show parameters
# n_init: Number of times the k-means algorithm will be run with different centroid seeds.
# The final results will be the best output of n_init consecutive runs in terms of inertia.
# max_iter: Maximum number of iterations of the k-means algorithm for a single run.
mds = sklearn.manifold.MDS(dissimilarity = 'precomputed', n_init=10, max_iter=1000)
# NOTE: you will get different numbers everytime you run this. this is because youll
# find different local mins
# The key takeaway here is that the distance between data points are preserved
data2D = mds.fit_transform(distHD)
# Recall: were using new columns that summarize the distHD table..pick new column names
data2D = pd.DataFrame(data2D, columns=['x', 'y'], index = dataHD.index)
data2D
## Plot the MDS 2D result
%matplotlib inline
ax = data2D.plot.scatter(x='x', y='y')
# How to label those data points
ax.text(data2D.x[0], data2D.y[0], 'A')
ax.text(data2D.x[1], data2D.y[1], 'B')
ax.text(data2D.x[2], data2D.y[2], 'C')
ax.text(data2D.x[3], data2D.y[3], 'D')
dist2D = sklearn.metrics.euclidean_distances(data2D)
dist2D = pd.DataFrame(dist2D, columns = data2D.index, index = data2D.index)
dist2D
## Stress function...the formula given above
np.sqrt(((distHD - dist2D) **2).sum().sum() / (distHD**2).sum().sum())