3

I wanted to create a distance proximity matrix for 10060 records/ points, where each record/point has 23 attributes using euclidean distance as metric. I wrote code using nested for loops to calculate distance between each point(leading to (n(n-1))/2) computations). It took a long time(about 8 minutes). When I used cdist it took so much lesser time(just 3 seconds !!!). When I looked at the source code, the cdist also uses nested for loops and moreover it makes n^2 computations(which is greater than the number of comparisons my logic does). What is making cdist execute faster and give correct output as well ? Please help me understand. Thanks in advance.

4

1 回答 1

6

Where did you read the source code? The python code calls (if you follow it all the way down in the default metric='euclidean' case) the c code

static NPY_INLINE int
cdist_seuclidean(const double *XA, const double *XB, const double *var,
                 double *dm, const npy_intp num_rowsA, const npy_intp num_rowsB,
                 const npy_intp num_cols)
{
    npy_intp i, j;

    for (i = 0; i < num_rowsA; ++i) {
        const double *u = XA + (num_cols * i);
        for (j = 0; j < num_rowsB; ++j, ++dm) {
            const double *v = XB + (num_cols * j);
            *dm = seuclidean_distance(var, u, v, num_cols);
        }
    }
    return 0;
}

where seuclidean_distance is

static NPY_INLINE double
seuclidean_distance(const double *var, const double *u, const double *v,
                    const npy_intp n)
{
    double s = 0.0;
    npy_intp i;

    for (i = 0; i < n; ++i) {
        const double d = u[i] - v[i];
        s += (d * d) / var[i];
    }
    return sqrt(s);
}

So it's actually a triple loop, but this is highly optimised C code. Python for loops are slow, they take up a lot of overhead and should never be used with numpy arrays because scipy/numpy can take advantage of the underlying memory data held within an ndarray object in ways that python can't.

于 2018-08-01T10:09:06.720 回答