您可能需要标准化,但这需要比两个输入向量更多的数据。当您想赋予其中一项功能(我认为这是两个功能)比另一项更重要/更不重要时,就会应用加权。
例如,我人为地考虑了应用标准化的整个范围(以整数步长),并将您的单个示例与标准化和无程序(即,对数据不做任何事情)进行了比较。这是结果:
(standardization) Similarity: 0.744599 Data: (-1.12599, 0.88339), (-0.259844, 1.47232).
( normalization) Similarity: 0.978736 Data: (0.166667, 0.75), (0.416667, 0.92).
( none) Similarity: 0.997788 Data: (20, 175), (35, 192).
至少对我来说,使用标准化的结果更有意义。
以下是生成上述内容的示例基本代码:
import numpy
def cosine_dist(a, b): # Similarity between a and b
return sum(a * b) / ((sum(a ** 2) * sum(b ** 2)) ** 0.5)
age_range = [10., 70.]
height_range = [100., 200.]
# Input.
age = numpy.array([20, 35])
height = numpy.array([175, 192])
# Normalization
age_n = numpy.array(age, dtype=float)
height_n = numpy.array(height, dtype=float)
age_n = (age_n - age_range[0]) / (age_range[1] - age_range[0])
height_n = (height_n - height_range[0]) / (height_range[1] - height_range[0])
# Standardization.
all_age = numpy.array(range(*map(int, age_range)))
all_height = numpy.array(range(*map(int, height_range)))
age_s = numpy.array(age, dtype=float)
height_s = numpy.array(height, dtype=float)
age_s = (age_s - all_age.mean()) / all_age.std()
height_s = (height_s - all_height.mean()) / all_height.std()
for name, a, h in [('standardization', age_s, height_s),
('normalization', age_n, height_n), ('none', age, height)]:
data = numpy.array([(a[0], h[0]), (a[1], h[1])])
data_s = '(%g, %g), (%g, %g)' % (data[0][0], data[0][1], data[1][0], data[1][1])
print "(%15s) Similarity: %g\t\tData: %s." % (name, cosine_dist(*data),
data_s)