我们正在寻找一种有效的算法来解决以下问题:
给定两个越来越排序的数组。在每个数组中查找差异低于用户给定阈值的最接近的对应元素。
array1[i] +/- threshold
但只应返回最接近的可能候选者(在 的范围内)。第二个最接近的可以匹配到另一个元素,但不允许匹配多个元素。如果两个元素 与第一个(最左边)匹配array1
的距离相同,array2[j]
则应报告。数组可以包含重复的值。应该报告第一个(最左边的)匹配项(并且所有其他匹配项都被忽略/不匹配)。
例子:
x: 1, 3, 5, 6, 8
y: 3, 4, 5, 7
threshold: 1
output: NA, 1, 3, 4, NA
(index of y that matches best to x)
x: 1, 1.5, 2, 2.1, 5, 6.1, 7.2
y: 4.6, 4.7, 4.8, 4.9, 5, 6, 7, 8
threshold: 3
output: NA, NA, NA, 1, 5, 6, 7
(index of y that matches best to x)
x: 1, 1, 1, 2, 2, 2
y: 1, 2
threshold: 0
output: 1, NA, NA, 2, NA, NA
(index of y that matches best to x, for duplicates choose to first one)
x: 1, 2
y: 1, 1, 1, 2, 2, 2
threshold: 0
output: 1, 4
(index of y that matches best to x, for duplicates choose to first one)
在比较质谱时,我们使用它来找到两个m/z值(质荷比)之间最接近的匹配值。
目前,我们遍历两个数组并预测下两个元素的差异,如果找到更接近的元素,则更正前一个元素。但这对于连续两个以上的重复元素失败(第二个示例):
我们当前的实现(作为 R 包一部分的 C 代码): https ://github.com/rformassspectrometry/MsCoreUtils/blob/master/src/closest.c#L73-L129
下面的评论版本:
SEXP C_closest_dup_closest(SEXP x, SEXP table, SEXP tolerance, SEXP nomatch) {
/* x is the first array of doubles */
double *px = REAL(x);
const unsigned int nx = LENGTH(x);
/* table is the second array of doubles where x should be matched against */
double *ptable = REAL(table);
const unsigned int ntable = LENGTH(table);
/* user given tolerance threshold */
double *ptolerance = REAL(tolerance);
/* integer array to store the results */
SEXP out = PROTECT(allocVector(INTSXP, nx));
int* pout = INTEGER(out);
/* integer that should returned if no valid match or a closer one was found */
const unsigned int inomatch = asInteger(nomatch);
/* indices */
unsigned int ix = 0, ixlastused = 1;
unsigned int itbl = 0, itbllastused = 1;
/* differences: current, difference to next element of x and table, respectively */
double diff = R_PosInf, diffnxtx = R_PosInf, diffnxttbl = R_PosInf;
while (ix < nx) {
if (itbl < ntable) {
/* difference for current pair */
diff = fabs(px[ix] - ptable[itbl]);
/* difference for next pairs */
diffnxtx =
ix + 1 < nx ? fabs(px[ix + 1] - ptable[itbl]) : R_PosInf;
diffnxttbl =
itbl + 1 < ntable ? fabs(px[ix] - ptable[itbl + 1]) : R_PosInf;
if (diff <= ptolerance[ix]) {
/* valid match, add + 1 to convert between R/C index */
pout[ix] = itbl + 1;
if (itbl == itbllastused &&
(diffnxtx < diffnxttbl || diff < diffnxttbl))
pout[ixlastused] = inomatch;
ixlastused = ix;
itbllastused = itbl;
} else
pout[ix] = inomatch;
if (diffnxtx < diff || diffnxttbl < diff) {
/* increment the index with the smaller distance */
if (diffnxtx < diffnxttbl)
++ix;
else
++itbl;
} else {
/* neither next x nor next table item offer a better match */
++ix;
++itbl;
}
} else
pout[ix++] = inomatch;
}
/* R provided MACRO to free allocated memory */
UNPROTECT(1);
return out;
}
有人可以给我们一个更好算法的提示吗?