algorithm - Is there some math method to represent this kind of data?

Question

Suppose we have an array with cluster effect to some degree, such as

1 2 3 7 8 12 13 16 20 21 22 23

how do we represent this kind of data mathematically ? If we have the other array like this

1 2 10 11 20 21

the intersection of these two array is

1 2 20 21

Noted that we are in the situation that we have an fully paralleled algorithm to calculate the intersection of two arrays of this kind, we want to analyze the cost in math convention. The algorithm is about binary search every element of the short array in the longer one.

We designed some algorithm for GPU, which is quite fast. We find that the algorithm is faster on the data with these kind of cluster effect. Now we want to analyze our algorithm on these kind of data, but we have no idea to do this.

Is there something like random process or anything else can provide help to calculate the expectation of the cost ?

score 0 · Accepted Answer

遍历数组并找到每对之间的差异（0,1; 1,2; ...）。计算 1 的数量并将其除以 n-1。这会给你连续对的百分比。这是一个原始指标。

原始度量：

values = [1,2,3,4,5,8,9,10]  
values_length = 8  
consecutive = 0  
for i=0 to values_length - 1:  
    consecutive += ((values[i+1] - values[i]) == 1) ? 1 : 0
return consecutive/(values_length-1)

score 0 · Accepted Answer

我不知道你所说的完全并行算法是什么意思，但由于数组是排序的，你可以做一个时间复杂度为 O(m + n) 的顺序算法，其中 m 和 n 是数组长度：

int i = 0, j = 0;
while (i < array1.length && j < array2.length) {
    if (array1[i] == array2[j]) {
        add array1[i] to the intersection list
        ++i;
        ++j;
    } else if (array1[i] < array2[j]) {
        ++i;
    } else {
        ++j;
    }
}

这假定数组包含唯一值。如果一个值可能重复，则需要更好地定义问题，以确定什么构成了交集数组。

通过执行二进制搜索而不是在未找到匹配项时简单地增加 i 或 j ，该算法可能会加快很多速度。需要进行二分搜索，报告在未找到元素时应将其插入的位置。（仅仅报告失败将是浪费时间。）

algorithm - Is there some math method to represent this kind of data?

2 回答 2

Related

Reference