python - 查找相距很远的最大大小的字符串集

Question

给定一组相同长度的二进制字符串 S，找到 S 的最大大小子集 S' 的快速方法是什么，其属性是 S' 中每对字符串之间的汉明距离至少为 d？

以下函数计算两个字符串之间的汉明距离。

def hamdist(str1, str2):
    """Count the # of differences between equal length strings str1 and str2"""
    if (len(str1) != len(str2)):
        print str1, str2, "Length mismatch!"
        quit()
    diffs = 0
    for ch1, ch2 in itertools.izip(str1, str2):
        if ch1 != ch2:
            diffs += 1
    return diffs

score 3 · Accepted Answer

扩展 Egor 的评论：

想象一下，您有一个图G，其中的每个字符串都有一个顶点S。现在，对于每对顶点v1，v2求对应字符串之间的汉明距离s1，s2。如果大于，则在和之间d添加一条边。Gv1v2

现在，您想找到这样的最大子集，S使得该最大子集中的每一对字符串至少d在它们之间具有汉明距离。我们刚刚构建的图上的相应问题是找到最大的顶点集合，使得该集合中的每个顶点都连接到该集合中的每个其他顶点。

这是最大集团问题。如果您单击该链接到维基百科文章，您会发现解决此问题的最知名算法及时运行O(1.2599^n)，这是指数级的，因此速度不快。如果您可以快速解决您的问题（也就是说，在多项式时间内），那么您可以使用此对应关系在多项式时间内解决最大集团问题，因此您的问题没有快速解决方案。

score 2 · Accepted Answer

As Falk notes, in order to prove that this problem is NP-hard, we need a reduction from an NP-hard problem. I'm going to use the problem of finding an independent set (i.e., finding a clique in the complement graph).

My reduction has two stages. The output of the first stage is a generalized instance with a non-binary alphabet. The output of the second stage uses a binary alphabet. Let G = (V, E) be the graph in which we are trying to find a large independent set. The output of the first stage is |V| words of length |E|. Let e = (v, w) be the i^th edge. The letters in the i^th position of each word are all different except for the words corresponding to v and w, which share a letter there. The size of the alphabet is thus |V| - 1. Two vertices are non-adjacent if and only if their words are at maximum distance, so we set the distance threshold to |V|.

In the second stage, we replace each letter by one of the |V| - 1 words of length |V| - 1 that has 1 "1" and (|V| - 2) "0"s and double the distance threshold to 2 |V|.

To solve small instances, I would recommend using Sam's reduction to the maximum clique problem and running the exponential-time Bron–Kerbosch algorithm to enumerate all maximal cliques, from which the maximum can be selected. (Why B–K and not the faster algorithms? The latter are more complicated and won't extend your reach very far.)

python - 查找相距很远的最大大小的字符串集

2 回答 2

Related

Reference