c - Graph Paths Abstraction Algorithm Needed

Question

I have a data structure holding a graph like the one in the following picture: Tree Data Structure

In this tree, a node can have any number of unique children from the levels below it. In tree in the picture represents a set of paths. Where every path should begin with a node from Level 1, and ends with a node of "*" mark. So the paths of the tree in the picture are:

A then C then G

A then C then G then J

A then D then G

A then D then G the J

A then D then K, and so on...

Actually my original tree is huge (around 2 Million sequences) and the maximum number of nodes per level is 61 (of 11 levels). So it causes many memory consumption problems in my application (a computer vision application for SAMSUNG).

My target is to have an iterative algorithm that represents these paths in a more compact string format. So I think we the problem is divided into three steps as follows. I have built the tree data structure (step 2), but still can not derive an iterative algorithm that gets the output string/sequence in step 3 from the tree.

1- Input String:

(A C G) | (A C G J) | (A D G) | (A D G J ) | (A D K) | ....,

Where "|" represents alternatives.

2- Building Tree Data Structure of These Paths.

3- Required Output String:

(A (C G [J]) | (D (G [J]) | K)) | (B ....).

Where where "|" represents alternatives and "[ ]" encloses options. The target output string should be optimized like there are not more common factors that can be taken to more simplify it.

score 2 · Accepted Answer

您可以使用迭代 DFS 的修改，它利用堆栈来跟踪未处理的节点。该算法从不会在堆栈上存储超过 6 个字符^*对于任何一个节点，并且堆栈上的节点总是少于 N 个（其中 N 是图中的节点数）。您已经指出 N 最多为 61*11=671，因此堆栈上最多可能有大约 4000 个元素。

在下面的伪代码中，“目标”节点是上面示例中的星号节点，例如 G ^*。

初始化：

虚拟节点 Φ 被引入，其边缘从 Φ 到每个“根”节点，例如上面的节点 A 和 B。Φ 的标记被假定为非打印字符，或者您可以在将其添加到输出字符串之前显式检查。在调用函数之前，节点 Φ 被压入堆栈。

outString := ""
while stack not empty
   pop token
   if token is node
      outString := outString + node(token)  // Line 5 - explanation below
      if node(token) has children
         if node(token) is destination
            outString := outString + "["
            push "]"
         end
         if node(token) has multiple children
            for each child of node(token), from right to left
               push ")"
               push child
               push "("
               push "|"
            end
            pop // remove last "|"
         else
            push child
         end
      end

   else // token is ()[]|
      outString := outString + token
   end
end

该算法对图的第一部分（A 及其子项）的输出是（为清楚起见添加了额外的空格；可以轻松地将空格添加到代码中）：

A (C G [J]) | (D (G [J]) | (K))

您会注意到您的结果与我的结果之间存在偏差：在我的解决方案中，最终节点 K 括在括号中。A[(B)|(C)]如果这是不可取的（它可能导致像只需将上面的第 5 行替换为：

if (node(token) has no children
    AND last character of outString is "("
    AND next token on stack is ")")
      remove trailing "(" from outString
      concatenate token to outString
      pop ")" from stack and ignore
else
   outString := outString + node(token) // as above
end

如果您有任何问题或我遗漏了什么，请告诉我。

^*这将发生在（可能极不可能）将节点写为|[(A)]. 大多数节点将在堆栈中占用 4 个或更少的字符。

score 1 · Accepted Answer

请注意，这不是一个完整的答案 - 我无法将图像和所有内容放在评论中。

鉴于此图：

解决方案 1

(A (B | C) E*) | (A B D*) | (A C F*)

解决方案 2

(A B (D* | E*)) | (A C (E* | F*))

解决方案 3

(A (B (D* | E*) | C (E* | F*)))

问题

您想如何解决这种歧义？您必须准确地将您要寻找的内容定义为“答案”。