algorithm - 字符串分析

Question

给定一系列操作：

a*b*a*b*a*a*b*a*b

有没有办法获得最佳细分以启用子字符串的重用。

制造

a*b*a*b*a*a*b*a*b => c*a*c，其中 c = a*b*a*b

然后看到

a*b*a*b => d*d，其中 d = a*b

总而言之，将 8 个初始操作减少到此处描述的 4 个？

(c = (d = a*b)*d)*a*c

目标当然是最小化操作次数

我正在考虑一种后缀树。

我对线性时间启发式或解决方案特别感兴趣。'*' 操作实际上是矩阵乘法。

score 19 · Accepted Answer

这整个问题被称为“通用子表达式消除”或 CSE。它是函数式编程语言的编译器实现者面临的称为“图形缩减”问题的一个稍小的版本。谷歌搜索“通用子表达式消除算法”提供了很多解决方案，但我没有看到特别是对于矩阵乘法给出的约束。

链接的页面提供了很多参考。

我的旧答案如下。但是，经过更多研究，解决方案只是构建一个suffix tree。这可以在 O(N) 时间内完成（维基百科页面上有很多参考资料）。完成此操作后，子表达式（您的问题中的 c、d 等）只是后缀树中的节点 - 只需将它们拉出即可。

但是，我认为 MarcoS 提出了Longest repeating Substring的建议，因为图形缩减优先级可能不允许此处允许的优化。

算法草图：

optimise(s):
    let sub = longestRepeatingSubstring(s).
    optimisedSub = optimise(sub)
    return s with sub replaced by optimisedSub

最长重复子串的每次运行都需要时间 N。您可能可以重新使用您构建的后缀树在时间 N 内解决整个问题。

score 14 · Accepted Answer

编辑：除了接受的答案之外，还需要此答案中的增长顺序才能运行 CSE 或矩阵链乘法

有趣的是，压缩算法可能是您想要的：压缩算法旨在减小其压缩内容的大小，如果它唯一能做到这一点的方法是替换，您可以跟踪它并获得算法所需的子组件。尽管对于小输入，这可能不会给出很好的结果。

在选择这种算法时，哪些操作子集是可交换的将是一个重要的考虑因素。[编辑：OP说在他/她的情况下没有操作是可交换的]

如果我们忽略缓存等影响，我们还可以定义一个最佳解决方案：

input: [some product of matrices to compute]

given that multiplying two NxN matrices is O(N^2.376)
given we can visualize the product as follows:
    [[AxB][BxC][CxD][DxE]...]
we must for example perform O(max(A,B,C)^2.376) or so operations in order to combine
    [AxB][BxC] -> [AxC]

The max(...) is an estimate based on how fast it is to multiply two square matrices;
a better estimate of cost(A,B,C) for multiplying an AxB * BxC matrix can be gotten 
from actually looking at the algorithm, or running benchmarks if you don't know the
algorithm used.

However note that multiplying the same matrix with itself, i.e. calculating
a power, can be much more efficient, and we also need to take that into account.
At worst, it takes log_2(power) multiplies each of O(N^2.376), but this could be
made more efficient by diagonalizing the matrix first.

关于贪婪方法是否可行的问题是：是否应该在每一步压缩重复的子字符串。情况可能并非如此，例如

aaaaabaab
compressing 'aa' results in ccabcb and compressing 'aab' is now impossible

但是我有一种预感，如果我们尝试所有压缩子字符串的顺序，我们可能不会经常遇到这个问题。

因此，在写下我们想要的（成本）并考虑了可能存在的问题之后，我们已经有了一个可以做到这一点的蛮力算法，并且它将运行在非常少量的矩阵上：

# pseudocode

def compress(problem, substring)
    x = new Problem(problem)
    x.string.replaceall(substring, newsymbol)
    x.subcomputations += Subcomputation(newsymbol=substring)

def bestCompression(problem)
    candidateCompressions = [compress(problem,substring) for each substring in problem.string]
    # etc., recursively return problem with minimum cost
    # dynamic programming may help make this more efficient, but one must watch
    #   out for the note above, how it may be hard to be greedy

注意：根据 Asgeir 的另一个回答，这被称为矩阵链乘法优化问题。Nick Fortescue 指出，这也更普遍地称为http://en.wikipedia.org/wiki/Common_subexpression_elimination - 因此可以从文献中找到任何通用 CSE 或矩阵链乘法算法/库，并插入成本我之前提到的数量级（您将需要那些知道您使用哪种解决方案的人）。请注意，上述计算（乘法、求幂等）的成本假设它们是使用最先进的算法有效完成的；如果不是这种情况，请将指数替换为与执行操作的方式相对应的适当值。

score 9 · Accepted Answer

如果你想使用最少的算术运算，那么你应该看看矩阵链乘法，它可以减少到 O(n log n)

score 8 · Accepted Answer

从头顶上看，对我来说，问题似乎在 NP 中。根据您正在执行的替换，其他替换将是可能的或不可能的，例如对于字符串 d*e*a*b*c*d*e*a*b*c*d*e*a有几种可能性。

如果您采用最长的公共字符串，它将是： f = d*e*a*b*c您可以替换f*f*e*a为最后三个乘法和四个中间乘法（总共七个）。

如果您改为使用以下方式替换： f = d*e*a您f*b*c*f*b*c*f可以使用g = f*b*cto 进一步替换g*g*f，总共六次乘法。

在这个问题中还有其他可能的替换，但我现在没有时间计算它们。

我猜测对于一个完整的最小替换，不仅需要计算出最长的公共子串，而且还要计算出每个子串重复的次数，这可能意味着你必须跟踪到目前为止的所有替换并进行回溯。它仍然可能比实际的乘法更快。

score 7 · Accepted Answer

7

这不是最长的重复子串问题吗？

于 2011-05-12T09:31:50.650 回答

algorithm - 字符串分析

5 回答 5

Related

Reference