0

我们正在寻找一种有效的算法来解决以下问题:

给定两个越来越排序的数组。在每个数组中查找差异低于用户给定阈值的最接近的对应元素。array1[i] +/- threshold但只应返回最接近的可能候选者(在 的范围内)。第二个最接近的可以匹配到另一个元素,但不允许匹配多个元素。如果两个元素 与第一个(最左边)匹配array1的距离相同,array2[j]则应报告。

数组可以包含重复的值。应该报告第一个(最左边的)匹配项(并且所有其他匹配项都被忽略/不匹配)。

例子:

x: 1, 3, 5, 6, 8
y: 3, 4, 5, 7
threshold: 1
output: NA, 1, 3, 4, NA
(index of y that matches best to x)

x: 1, 1.5, 2, 2.1, 5, 6.1, 7.2
y: 4.6, 4.7, 4.8, 4.9, 5, 6, 7, 8
threshold: 3
output: NA, NA, NA, 1, 5, 6, 7
(index of y that matches best to x)

x: 1, 1, 1, 2, 2, 2
y: 1, 2
threshold: 0
output: 1, NA, NA, 2, NA, NA
(index of y that matches best to x, for duplicates choose to first one)

x: 1, 2
y: 1, 1, 1, 2, 2, 2
threshold: 0
output: 1, 4
(index of y that matches best to x, for duplicates choose to first one)

在比较质谱时,我们使用它来找到两个m/z值(质荷比)之间最接近的匹配值。

目前,我们遍历两个数组并预测下两个元素的差异,如果找到更接近的元素,则更正前一个元素。但这对于连续两个以上的重复元素失败(第二个示例):

我们当前的实现(作为 R 包一部分的 C 代码): https ://github.com/rformassspectrometry/MsCoreUtils/blob/master/src/closest.c#L73-L129

下面的评论版本:

SEXP C_closest_dup_closest(SEXP x, SEXP table, SEXP tolerance, SEXP nomatch) {
    /* x is the first array of doubles */
    double *px = REAL(x);
    const unsigned int nx = LENGTH(x);

    /* table is the second array of doubles where x should be matched against */
    double *ptable = REAL(table);
    const unsigned int ntable = LENGTH(table);

    /* user given tolerance threshold */
    double *ptolerance = REAL(tolerance);

    /* integer array to store the results */
    SEXP out = PROTECT(allocVector(INTSXP, nx));
    int* pout = INTEGER(out);

    /* integer that should returned if no valid match or a closer one was found */
    const unsigned int inomatch = asInteger(nomatch);

    /* indices */
    unsigned int ix = 0, ixlastused = 1;
    unsigned int itbl = 0, itbllastused = 1;
    /* differences: current, difference to next element of x and table, respectively */
    double diff = R_PosInf, diffnxtx = R_PosInf, diffnxttbl = R_PosInf;

    while (ix < nx) {
        if (itbl < ntable) {
            /* difference for current pair */
            diff = fabs(px[ix] - ptable[itbl]);
            /* difference for next pairs */
            diffnxtx =
                ix + 1 < nx ? fabs(px[ix + 1] - ptable[itbl]) : R_PosInf;
            diffnxttbl =
                itbl + 1 < ntable ? fabs(px[ix] - ptable[itbl + 1]) : R_PosInf;

            if (diff <= ptolerance[ix]) {
                /* valid match, add + 1 to convert between R/C index */
                pout[ix] = itbl + 1;
                if (itbl == itbllastused &&
                        (diffnxtx < diffnxttbl || diff < diffnxttbl))
                    pout[ixlastused] = inomatch;
                ixlastused = ix;
                itbllastused = itbl;
            } else
                pout[ix] = inomatch;

            if (diffnxtx < diff || diffnxttbl < diff) {
                /* increment the index with the smaller distance */
                if (diffnxtx < diffnxttbl)
                    ++ix;
                else
                    ++itbl;
            } else {
                /* neither next x nor next table item offer a better match */
                ++ix;
                ++itbl;
            }
        } else
            pout[ix++] = inomatch;
    }

    /* R provided MACRO to free allocated memory */
    UNPROTECT(1);
    return out;
}

有人可以给我们一个更好算法的提示吗?

4

0 回答 0