I wanted to create a distance proximity matrix for 10060 records/ points, where each record/point has 23 attributes using euclidean distance as metric. I wrote code using nested for loops to calculate distance between each point(leading to (n(n-1))/2) computations). It took a long time(about 8 minutes). When I used cdist it took so much lesser time(just 3 seconds !!!). When I looked at the source code, the cdist also uses nested for loops and moreover it makes n^2 computations(which is greater than the number of comparisons my logic does). What is making cdist execute faster and give correct output as well ? Please help me understand. Thanks in advance.
问问题
3997 次
1 回答
6
Where did you read the source code? The python code calls (if you follow it all the way down in the default metric='euclidean'
case) the c code
static NPY_INLINE int
cdist_seuclidean(const double *XA, const double *XB, const double *var,
double *dm, const npy_intp num_rowsA, const npy_intp num_rowsB,
const npy_intp num_cols)
{
npy_intp i, j;
for (i = 0; i < num_rowsA; ++i) {
const double *u = XA + (num_cols * i);
for (j = 0; j < num_rowsB; ++j, ++dm) {
const double *v = XB + (num_cols * j);
*dm = seuclidean_distance(var, u, v, num_cols);
}
}
return 0;
}
where seuclidean_distance
is
static NPY_INLINE double
seuclidean_distance(const double *var, const double *u, const double *v,
const npy_intp n)
{
double s = 0.0;
npy_intp i;
for (i = 0; i < n; ++i) {
const double d = u[i] - v[i];
s += (d * d) / var[i];
}
return sqrt(s);
}
So it's actually a triple loop, but this is highly optimised C code. Python for
loops are slow, they take up a lot of overhead and should never be used with numpy arrays because scipy/numpy can take advantage of the underlying memory data held within an ndarray
object in ways that python can't.
于 2018-08-01T10:09:06.720 回答