7

The code below (to compute cosine similarity), when run repeatedly on my computer, will output 1.0, 0.9999999999999998, or 1.0000000000000002. When I take out the normalize function, it will only return 1.0. I thought floating point operations were supposed to be deterministic. What would be causing this in my program if the same operations are being applied on the same data on the same computer each time? Is it maybe something to do with where on the stack the normalize function is being called? How can I prevent this?

#! /usr/bin/env python3

import math

def normalize(vector):
    sum = 0
    for key in vector.keys():
        sum += vector[key]**2
    sum = math.sqrt(sum)
    for key in vector.keys():
        vector[key] = vector[key]/sum
    return vector

dict1 = normalize({"a":3, "b":4, "c":42})
dict2 = dict1

n_grams = list(list(dict1.keys()) + list(dict2.keys()))
numerator = 0
denom1 = 0
denom2 = 0

for n_gram in n_grams:
    numerator += dict1[n_gram] * dict2[n_gram]
    denom1 += dict1[n_gram]**2
    denom2 += dict2[n_gram]**2

print(numerator/(math.sqrt(denom1)*math.sqrt(denom2)))
4

1 回答 1

14

浮点数学可能是确定性的,但字典键的顺序不是。

当您调用.keys()时,结果列表的顺序可能是随机的。

因此,循环内的数学运算的顺序也可能是随机的,因此结果不会是确定性的,因为尽管任何单个浮点运算都可能是确定性的,但一系列运算的结果在很大程度上取决于排序.

您可以通过对密钥列表进行排序来强制执行一致的顺序。

于 2014-02-08T06:27:44.183 回答