java - 比较数字集之间相似性的有效算法？

Question

我有大量的数字集。每组包含 10 个数字，我需要删除与任何其他组匹配的具有 5 个或更多数字（无序）的所有组。

例如：

set 1: {12,14,222,998,1,89,43,22,7654,23}
set 2: {44,23,64,76,987,3,2345,443,431,88}
set 3: {998,22,7654,345,112,32,89,9842,31,23}

鉴于第 1 组和第 3 组以上的 3 组 10 个数字将被视为重复，因为它们有 5 个匹配的数字。因此，在这种情况下，我将删除第 3 组（因为它被认为与第 1 组相似）。

我有超过 10000 套要比较，我想非常有效地做到这一点。我一直在把它翻过来，我只是想不出一种有效的方法来进行这种比较（一次通过就很好）。

有任何想法吗？谢谢！

麦克风

score 27 · Accepted Answer

You should rethink your requirements because as it is, the operation does not even have a well-defined result. For example, take these sets:

set 1: {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} 
set 2: {6, 7, 8, 9, 10, 11, 12, 13, 14, 15} 
set 3: {11, 12, 13, 14, 15, 16, 17, 18, 19, 20}

If you first consider 1 and 2 to be "duplicates" and eliminate set 1, then 2 and 3 are also "duplicates" and you are left with only one remaining set. But if you instead eliminate set 2 first, then 1 and 3 have no matches and you are left with two sets remaining.

You can easily expand this to your full 10,000 sets so that it would be possible that depending on which sets you compare and eliminate first, you could be left with only a single set, or with 5,000 sets. I don't think that is what you want.

Mathematically speaking, your problem is that you are trying to find equivalence classes, but the relation "similarity" you use to define them is not an equivalence relation. Specifically, it is not transitive. In layman's terms, if set A is "similar" to set B and set B is "similar" to set C, then your definition does not ensure that A is also "similar" to C, and therefore you cannot meaningfully eliminate similar sets.

You need to first clarify your requirements to deal with this problem before worrying about an efficient implementation. Either find a way to define a transitive similarity, or keep all sets and work only with comparisons (or with a list of similar sets for each single set).

score 6 · Accepted Answer

Another perfect job for a Signature Tree. Once again I'm aghast that there isn't a library out there that implements them. Let me know if you write one.

From the abstract of the first paper in the search results above:

We propose a method that represents set data as bitmaps (signatures) and organizes them into a hierarchical index, suitable for similarity search and other related query types. In contrast to a previous technique, the signature tree is dynamic and does not rely on hardwired constants. Experiments with synthetic and real datasets show that it is robust to different data characteristics, scalable to the database size and efficient for various queries.

score 2 · Accepted Answer

I don't think there's a nice and beautiful way to do it. Most other answers will have you make a comparison between most pairs x,y which would be O(N^2). You can do it faster.

Algorithm: keep an array of all 5-tuples. For each new split it into all possible 5-tuples, add to that array. At the end, sort and check for duplicates.

There are C(10, 5) = 10*9*8*7*6/120 = 9*4*7, roughly 250 subsets of length 5 of set of length 10. So you're keeping a table which is 10^3 times larger than your data but perform just O(250*N) operations. That should work practically and I suspect that;s the best theoretically as well.

score 2 · Accepted Answer

关于可能出现的数字范围你没有说太多，但我有两个想法：

一个倒排列表，它将列表中出现的数字映射到包含它的列表，然后将这些列表相交以找到具有多个共同数字的列表。
将数字划分或将它们分组为“接近”数字的范围，然后细化（缩小）有数字出现在这些范围内的列表。您减少了匹配列表的范围，您拥有可管理的列表数量，并且可以准确地比较这些列表。我认为这将是一种“接近”的方法。

score 2 · Accepted Answer

有一种方法可以做到这一点，时间效率高但空间效率极低。

如果我的数学是正确的，那么一组 10 个数字中的 5 个数字的每个组合都会产生 10！（10-5）！5！= 252 个组合乘以 10000 组 = 252 万个组合。一组 5 个整数将消耗 20 个字节，因此您可以将每个集合的每个组合放入HashSet. 并且只使用 5 兆字节（加上开销，这至少会炸掉 2-3 倍）。

现在这可能看起来很昂贵，但是如果替代方案是，当您单独检查一组新的 10 组与现有的 10000 组时，您计算 252 组 5 组并查看其中是否有任何一组，那么它必须更好。

基本上：

public class SetOf5 {
  private final static HashSet<Integer> numbers;
  private final int hashCode;

  public SetOf5(int... numbers) {
    if (numbers.length != 5) {
      throw new IllegalArgumentException();
    }
    Set<Integer> set = new HashSet<Integer>();
    hashCode = 19;
    for (int i : numbers) {
      set.add(i);
      hashCode = 31 * i + hashCode;
    }
    this.numbers = Collections.unmodifiableSet(set);
  }

  // other constructors for passing in, say, an array of 5 or a Collectio of 5

  // this is precalculated because it will be called a lot
  public int hashCode() {
    return numbers.hashCode();
  }

  public boolean equals(Object ob) {
    if (!(ob instanceof SetOf5)) return false;
    SetOf5 setOf5 = (SetOf5)ob;
    return numbers.containsAll(setOf5.numbers);
  }
}

然后你只需要做两件事：

为所有现有的 5 元组创建一个HashSet<SetOf5>；和
创建一个算法来创建所有可能的 5 组。

然后你的算法变成：对于每组 10 个数字，创建所有可能的 5 组，检查每组以查看它是否在集合中。如果是，则拒绝 10 个集合。如果不是，则将 5 个集合添加到“集合集合”中。否则继续。

我想你会发现这将比任何 10000 组的暴力比较便宜得多——至少在 10 组中有 5 个数字的情况下。

score 1 · Accepted Answer

Since you need to compare all pair of sets, the algorithm is about O(N^2) where N is the size of the set.

For each comparison, you can do about O(X+Y), where X and Y are the size of two sets, in your case 10 each, so it is constant. But this requires you sort all the sets beforehand, so that adds to O(N*xlgx), again xlgx is constant in your case.

The linear comparison algorithm for two sets is fairly simple as the sets are sorted now, you can iterating both the sets at the same time. See c++ std::set_intersection for detail.

The entire algorithm is then O(N^2), which would be pretty slow for 10000 sets.

score 1 · Accepted Answer

You should find the Pearson Coefficient between two sets of data. This method will make your program easily scalable to huge data sets.

score 1 · Accepted Answer

也许您需要这样的算法（据我了解您的问题）？

import java.util.Arrays;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Set;

/**
 * @author karnokd, 2009.06.28.
 * @version $Revision 1.0$
 */
public class NoOverlappingSets {
    // because of the shortcomings of java type inference, O(N)
    public static Set<Integer> setOf(Integer... values) {
        return new HashSet<Integer>(Arrays.asList(values));
    }
    // the test function, O(N)
    public static boolean isNumberOfDuplicatesAboveLimit(
            Set<Integer> first, Set<Integer> second, int limit) {
        int result = 0;
        for (Integer i : first) {
            if (second.contains(i)) {
                result++;
                if (result >= limit) {
                    return true;
                }
            }
        }
        return false;
    }
    /**
     * @param args
     */
    public static void main(String[] args) {
        // TODO Auto-generated method stub
        List<Set<Integer>> sets = new LinkedList<Set<Integer>>() {{
            add(setOf(12,14,222,998,1,89,43,22,7654,23));
            add(setOf(44,23,64,76,987,3,2345,443,431,88));
            add(setOf(998,22,7654,345,112,32,89,9842,31,23));
        }};
        List<Set<Integer>> resultset = new LinkedList<Set<Integer>>();
        loop:
        for (Set<Integer> curr : sets) {
            for (Set<Integer> existing : resultset) {
                if (isNumberOfDuplicatesAboveLimit(curr, existing, 5)) {
                    continue loop;
                }
            }
            // no overlapping with the previous instances
            resultset.add(curr);
        }
        System.out.println(resultset);
    }

}

我不是大 O 表示法的专家，但我认为这个算法是 O(N*M^2) 其中 N 是集合中的元素数，M 是集合的总数（基于循环数 I算法中使用）。我冒昧地定义了我认为重叠的集合。

我认为你的问题是 Polinomial。我记得我的讲座，基于决策的版本将是 NP-hard - 但如果我错了，请纠正我。

score 1 · Accepted Answer

这是一个简单的问题，因为您的套装仅限于十个。对于每十个数字的集合，您有少于 1,000 个包含至少五个数字的集合子集。选择一个散列函数，将整数序列散列为 32 位数字。对于每十个整数的集合，为具有五个或更多元素的整数的每个子集计算此哈希函数的值。这为每组十个数字提供少于 1,000 个哈希值。在所有这 1,000 个键下将指向十个整数集的指针添加到哈希表中。完成此操作后，您的哈希表就有 1,000 * 10,000 = 1000 万个条目，这是完全可行的；并且这第一遍是线性的 (O(n))，因为单个集合的大小以 10 为界。

在下一轮中，以任何顺序遍历所有哈希值。每当有多个集合与相同的哈希值相关联时，这意味着它们很可能包含至少五个整数的公共子集。验证这一点，然后擦除其中一组和相应的哈希表条目。继续遍历哈希表。这也是一个 O(n) 步骤。

最后，假设您在 C 中执行此操作。这是一个例程，用于计算一组十个整数的哈希值。假设整数按升序排列：

static int hash_index;

void calculate_hash(int *myset, unsigned int *hash_values)
{
  hash_index = 0;
  hrec(myset, hash_values, 0, 0, 0);
}

void hrec(int *myset, unsigned int *hash_values,
          unsigned int h, int idx, int card)
{
  if (idx == 10) {
    if (card >= 5) {
      hash_values[hash_index++] = h;
    }
    return;
  }
  unsigned int hp = h;
  hp += (myset[idx]) + 0xf0f0f0f0;
  hp += (hp << 13) | (hp >> 19);
  hp *= 0x7777;
  hp += (hp << 13) | (hp >> 19);
  hrec(myset, hash_values, hp, idx + 1, card + 1);
  hrec(myset, hash_values, h,  idx + 1, card);
}

这将遍历所有 1024 个子集，并将基数为 5 或更多的子集的哈希值存储在hash_values数组中。最后，hash_index 计算有效条目的数量。它当然是恒定的，但我没有在这里用数字计算它。

score 0 · Accepted Answer

我们将获取数据集，用签名装饰每个元素，然后对其进行排序。签名具有排序将那些可能有重复的元素组合在一起的属性。在将 data_set[j] 与 data_set[j+1 ...] 中的项目进行比较时，当 [j+1 ...] 中的第一个签名重复检查失败时，我们推进 i。这个“邻接标准”确保我们不必进一步研究；除此之外的任何元素都不能重复。

这大大减少了 O(N^2) 比较。我会让算法分析师决定多少，但下面的代码进行了大约 400k 的比较，而不是 100m 的简单 O(N^2)。

签名从对元素进行分桶开始。我们将数字的范围划分为 N 个大小相等的桶：1..k, k+1..2k, 2k+1..3k, ... 当迭代元素时，如果数字落入，我们增加计数一个特殊的桶。这会产生形式 (0,0,0,1,3,0,0,...4,2) 的初始签名。

签名具有如果

sum(min(sig_a[i], sig_b[i]) for i in range(10)) >= 5

那么与签名关联的元素可能至少有 5 个重复项。但更重要的是，如果以上不成立，则元素不能有 5 个重复项。让我们称之为“签名匹配标准”。

但是，按上述签名排序不具有上述邻接性。但是，如果我们将签名修改为二元形式：

(sum(sig[:-1]), sig[-1])

那么“签名匹配标准”成立。但是邻接标准成立吗？是的。该签名的总和是 10。如果我们枚举，我们有以下可能的签名：

(0,10) (1, 9) (2, 8) (3, 7) (4, 6) (5, 5) (6, 4) (7, 3) (8, 2) (9, 1) (10,0)

如果我们将 (0,10) 与 (1,9) .. (10,0) 进行比较，我们注意到一旦签名测试失败，它就再也不会变为 true。邻接准则成立。此外，该邻接标准适用于所有正值，而不仅仅是“5”。

好的，这很好，但是将签名分成两个大桶不一定会减少 O(N^2) 搜索；签名过于笼统。我们通过为 sig[:-1] 创建签名来解决这个问题，产生

(sum(sig[:-1]), sig[-1]), (sum(sig[:-2]), sig[-2]), ...

等等。我相信这个签名仍然满足邻接，但我可能是错的。

有一些优化我没有做：签名只需要每个元组的最后一个值，而不是第一个值，但排序步骤必须修改。此外，当很明显进一步的扫描无法成功时，可以通过早期失败来优化签名比较。

# python 3.0
import random

# M number of elements, N size of each element
M = 10000
N = 10

# Bounds on the size of an element of each set
Vmin,Vmax = 0, (1 << 12)

# DupCount is number of identical numbers required for a duplicate
DupCount = 5

# R random number generator, same sequence each time through
R = random.Random()
R.seed(42)

# Create a data set of roughly the correct size
data_set = [list(s) for s in (set(R.randint(Vmin, Vmax) for n in range(N)) for m in range(M)) if len(s) == N]

# Adorn the data_set signatures and sort
def signature(element, width, n):
"Return a signature for the element"
    def pearl(l, s):
        def accrete(l, s, last, out):
            if last == 0:
                return out
            r = l[last]
            return accrete(l, s-r, last-1, out+[(s-r,r)])
        return accrete(l, s, len(l)-1, [])
    l = (n+1) * [0]
    for i in element:
        l[i // width] += 1
    return pearl(l, len(element))

# O(n lg(n)) - with only 10k elements, lg(n) is a little over 13
adorned_data_set = sorted([signature(element, (Vmax-Vmin+1)//12, 12), element] for element in data_set)

# Count the number of possible intersections
def compare_signatures(sig_a, sig_b, n=DupCount):
    "Return true if the signatures are compatible"
    for ((head_a, tail_a), (head_b, tail_b)) in zip(sig_a, sig_b):
        n -= min(tail_a, tail_b)
        if n <= 0:
            return True
    return False

k = n = 0
for i, (sig_a, element_a) in enumerate(adorned_data_set):
    if not element_a:
        continue
    for j in range(i+1, len(adorned_data_set)):
        sig_b, element_b = adorned_data_set[j]
        if not element_b:
            continue
        k += 1
        if compare_signatures(sig_a, sig_b):
            # here element_a and element_b would be compared for equality
            # and the duplicate removed by  adorned_data_set[j][1] = []
            n += 1
        else:
            break

print("maximum of %d out of %d comparisons required" % (n,k))

score 0 · Accepted Answer

Lets assume you have a class NumberSet which implements your unordered set (and can enumerate ints to get the numbers). You then need the following data structures and algorithm:

Map<int, Set<NumberSet>> numberSets
Map<Pair<NumberSet, NumberSet>, int> matchCount
Pair<X,Y> is a key object which returns the same hashcode and equality for each instance with the same X and Y (no matter if they are swapped)

Now for each set to be added/compared do the following (pseudocode!!!):

for (int number: setToAdd) {
   Set<NumberSet> numbers = numberSets.get(number);
   if (numbers == null) {
      numbers = new HashSet<NumberSet>();
      numberSets.put(number, numbers);
   } else {
      for (NumberSet numberSet: numbers) {
         Pair<NumberSet, NumberSet> pairKey = new Pair<NumberSet, NumberSet>(numberSet, setToAdd);
         matchCount.put(pairKey, matchCount.get(pairKey)+1); // make sure to handle null as 0 here in real code ;)
      }
   }
   numbers.add(number);
}

At any time you can go through the pairs and each which has a count of 5 or greater shows a duplicate.

Note: removing the sets may be a bad idea, because if A is considered a duplicate of B, and B a duplicate of C, so C doesn't have to be a duplicate of A. So if you remove B, you'd not remove C, and the order in which you add your sets would become important.

score -2 · Accepted Answer

看起来你想使用HashSet类。这应该给你O(1)查找时间，如果你的循环正确，这应该可以提供非常有效的比较。（我不是在这里讨论算法，而是简单地建议一个数据结构以防万一。）

java - 比较数字集之间相似性的有效算法？

12 回答 12

Related

Reference