c - Optimal string containing all strings from a given set as substrings

Question

I'm working with some code that needs a large (constant) bit array. Since it contains large constant spans (all 0's or all 1's) I've broken it into a two-level table that allows duplicate spans (constant or otherwise) to be elided, as in:

bitn = table2[table1[n/256]+n%256/8]&1<<n%8

At this point, entries of table1 are all multiples of 32 (256 bits), but I wonder if significant savings could be achieved by allowing the spans in table2 to overlap. So my question is (stated in the abstract form):

Given N strings { S_n : n=1..N } each of length K, is there an efficient way to find the shortest length string S such that each S_n is a substring of S?

(Note that since I probably want to keep my bit arrays 8-bit aligned, my particular application of the problem will probably be dealing with strings of 8-bit bytes rather than strings of bits, but the problem makes sense on any sense of character - bit, byte, or whatever.)

score 1 · Accepted Answer

首先，这个问题可以表述为一个 TSP。我们有一组节点（每个字符串都是一个节点），我们需要找到访问所有节点的路径。字符串 x 和 y 之间的距离定义为 len(xy)+len(y) 其中 xy 是同时具有 x 和 y 且以 x 开头的最优字符串（例如 x=000111, y=011100, xy=0001100 , 距离(x,y) = 8-6=2)。

请注意，这也遵守三角不等式（distance(x,z) <= distance(x,y)+ distance(y,z) ）。距离是从 1 到 k 的整数。此外，距离是不对称的。

此版本的 TSP 称为 (1,B)-ATSP。有关此类问题的分析和近似解决方案，请参见http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.3439。

score 1 · Accepted Answer

关于具有大段常量的大型常量位数组，这里有另一种设计表格的方法供您考虑（我不知道您的确切需求，所以我不能说它是否有帮助）。

考虑类似基数树的东西。为了便于解释，让我定义 get 函数：

#define TYP_CONST
#define TYP_ARRAY

struct node {
    unsigned min;
    unsigned max;
    int typ;
    union {
        char *bits;
        int constant;
    } d;
    struct node *left;
    struct node *right;
}

struct bit_array {
    unsigned length;
    struct node *root;
}

int get(struct bit_array *b, unsigned ix)
{
    struct node *n = b->root;
    if (ix >= b->length)
        return -1;
    while (n) {
        if (ix > n->max) {
            n = n->right;
            continue;
        } else if (ix < n->min) {
            n = n->left;
            continue;
        }
        if (n->typ == TYP_CONST)
            return n->d.constant;
        ix -= n->min;
        return !!(n->d.bits[ix/8] & (1 << ix%8));
    }
    return -1;
}

用人类的话来说，你想在树中搜索你的比特。每个节点负责一定范围的位，并且您通过范围进行二进制搜索以找到您想要的范围。

找到范围后，有两个选项：常量或数组。如果常量，只需返回常量（为您节省大量内存）。如果是数组，则在位数组中进行数组查找。

您将有 O(log n) 查找时间而不是 O(1).... 尽管它仍然应该非常快。

这里的困难在于设置适当的数据结构很烦人并且容易出错。但是你说数组是恒定的，所以这可能不是问题。

c - Optimal string containing all strings from a given set as substrings

2 回答 2

Related

Reference