c++ - 在 C++ 中表示多遍抽象语法树 (AST)？

Question

我目前正在探索设计一个在多个阶段转换其 AST 的编译器。这个想法是，从解析树开始，每次传递都会转换树，直到生成的 AST 被优化并包含生成中间代码（在本例中为LLVM IR）所需的树的每个节点中的所有必需信息。遍历树可能会显着改变其结构，例如通过运算符优先级解析将运算符和操作数列表更改为有序操作的层次结构。请注意，通过可能会使结构的某些部分完全不变。

所以，我的问题是我如何最好地（阅读：最容易，尽可能少的重复）表示在 C++ 中具有多个中间表示的 AST？我希望每个阶段的 AST 版本中的节点类型在编译时尊重它们的不兼容性。我认为关键问题是我应该如何表示结构中在传递之间不发生变化的部分，同时避免重复代码？我想这是编译器作者过去多次解决的问题。

请注意，我目前在我的 AST 中使用Boost Variant而不是正常的运行时多态性，并且希望有一个与之兼容的解决方案。

score 4 · Accepted Answer

AST nodes by themselves don't need huge amounts of complexity. I think all this AST node machinery is just overkill.

The problem with ASTs isn't node type safety; its tree shape safety. An AST represents (presumably) some valid instance of some language L. What you ideally want is for transformations on the AST to produce other valid ASTs (instances of the language L). You're not going to guarantee that by guaranteeing that any one node has a valid type; you can only do it by guaranteeing that any tree patch produces a valid tree. And this is very difficult to do if the tree operations are atomic (e.g., "change node that", "replace child", "replace parent") and applied seperately; after several such steps, what precisely can you say about the tree?

This is better done using a kind of tree-rewrite transaction, e.g., source-to-source transformations whose grammatical structure is valid for language L, and which are applied in places that are valid for that transformation.

Most standard program transformation systems do this. They achieve this by holding a model of grammar for L, and checking that the proposed transforms are well-typed. This ensures that transformations of language L to language L stay well-formed.

This is harder to get right if the transformations map from one language A to another language B; if some such transformations are applied, you usually get a tree with mixed types that is not legal in either language. With care, one can define a set of transforms that map all subtrees of language A to language B, and apply them exhaustively; then you want the resulting tree to be well formed for B. You can ensure that by insisting whenever a B-patch is inserted in a mixed tree, if it is adjacent to another B-patch, that the resulting compound B-patch is well formed. This you can do using the same style of grammar checks.

Using these ideas, you can build a system that maps an AST through a series of "representations" (langauges A, B, C, ....) and have some faith that the result tree is well-shaped. This idea generalizes to graph rewrites.

score 3 · Accepted Answer

这是基于类型安全boost::variant的 AST 的快速测试。

我包含了一个简单的“结构保持转换”，它只是改变了存储在每个 AST 节点中的数据类型。然而，理论上，您可以编写一个任意的astFunc，既可以对节点进行结构转换，也可以基于数据的转换——只需编写一个type_list包含每个节点之前和之后的有效类型的 a。

template<typename... Ts>
struct type_list {};

// specialize data_type to store something special in your AST node:
// (by default, an entry means "the type of the data")
tempalte<typename T>
struct data_type { typedef T type; };
template<typename T>
using DataType = typename data_type<T>::type;

template<template<typename>class F, typename typelist>
struct map_types;
template<template<typename>class F, template<typename...>L, typename... Ts>
struct map_types<F, L<Ts...>> {
  typedef L< F<Ts>... > type;
};
template<template<typename>class F, typename typelist>
using MapTypes = typename map_types<F, typelist>::type;

template<template<typename...>class F, typename typelist>
struct apply_list;
template<template<typename...>class F, template<typename...>class L, typename... Ts>
struct apply_list<F, L<Ts...>> {
  typedef F<Ts...> type;
};
template<template<typename...>class F, typename typelist>
using ApplyList = typename apply_list<F, typelist>::type;
template<typename typelist>
using Var = ApplyList< boost::variant, MapTypes<DataType, typelist> >;

template<typename type_list>
struct AST_Node {
  typedef std::unique_ptr<AST_Node> upAST_Node;
  std::vector<upAST_Node> children;
  Var<type_list> data;
  template<typename T>
  AST_Node( T&& t ):data( std::forward<T>(t) ) {}
};
template<typename type_list>
using upAST_Node = typename AST_Node<type_list>::upAST_Node;

template<typename before_types, typename after_types>
using typeFunc = std::function< Var<after_types>(Var<before_types>) >;

template<typename before_types, typename after_types>
using astFunc = std::function< upAST_Node<after_types>(upAST_Node<before_types>) >;

template<typename before_types, typename after_types>
astFunc<before_types, after_types> elementWiseTransform( typeFunc<before_types, after_types> func ) {
  return [func]( upAST_Node<before_types> before )->upAST_Nodes<after_types> {
    upAST_Node<after_types> after( new AST_Node<after_types>( func( before ) ) );
    after->children.reserve( before->children.size() );
    for( auto& child: before->children ) {
      after->children.push_back( elementWiseTransform(func)(std::move(child)) );
    }
    return after;
  };
}

现在这只是一个开始。

您可以更进一步，让每种类型的节点都有一组不同类型的子节点，甚至不同的数量。只需为每种类型的节点创建特征类，例如 my data_type，例如children_types. 然后使用与我定义的方式类似的技术来Var定义您孩子的类型。基本上，您通过variant. 哎呀，你可以将孩子和一起捆绑成一个变体。std::vector< AST_Node<ChildType<type_list_element>>>MapTypesstd::vectordata

这将允许您为单个AST_Node类型（生成另一种AST_Node类型）编写映射，将它们聚合在一起并生成一个AST_Node<before, after>仿函数，然后遍历树。一些仿函数只对数据进行操作，然后让父逻辑接管孩子，一些会转换整个子树，一些会对数据进行操作并阻止父逻辑运行在孩子身上。

该技术变得棘手，因为您必须以不需要将它们全部堆积在一起的方式从您的各个函数中合成提升变体访问者。如果你看这里，你会看到一些关于如何获取一堆std::function<T(U)>并将它们变成一个函子的技术，该函子接受U. 投入一些工作来计算返回类型的联合（一个简单type_list的删除重复类型，然后泵入 a boost::variant，可能会这样做）——这样的“合并函子”将是一个有效的访问者。

现在您可以编写“重新映射 operator_add 类型的 AST 节点”函子，以及“重新映射 operator_mult 类型的 AST 节点”以及其他一些函子，将它们绑定到一个巨型函子中，将它们扔给 AST 遍历算法，然后让它喷出一棵 AST 树，其中某些类型转换为其他类型...

但这将是很多工作。

哦，我们可能想要“阶段标记”，其中阶段 1 和阶段 2 AST 是不同的类型。我们可以用它的阶段标记每个类型type_list，或者我们可以只标记AST树本身。哎呀，我们可以为AST使用其他未使用struct的 s 命名阶段，并将通过阶段的进程定义为类型到类型的函子，该函子在的签名中应用和强制执行astFunc<before_phase, before_types, after_phase, after_types>。

所以这还不错。我们创建一个type_list节点类型。这些类型不必是实际存储的数据。但它可以。

我们创建一个data_type特征类，将每个节点类型映射到存储的数据。我们创建一个child_types特征类，将每个节点类型映射到子 AST 的 type_list。

每个AST_Node存储一个variant<AST_Concrete_Node<Ts>...>. AST_Concrete_Node包含 aDataType<T> data;和 a MapTypes< std::vector, MapTypes< AST_Node, ChildTypes<T> > > children; （又名, std::vector< AST_Node<ChildrenTypes...> >，但你不能直接这么说）。

接下来，AST_Concrete_Node<T>转换功能在模板元编程的一个棘手位中结合在一起，以提升变体访问者。这一步真的很棘手，但我认为可行。完成了额外的工作以跳过未提及的类型，因此我们不必经常说“哦，我们不想转换节点 X”，而是必须说“如果我们命中节点 Y，不要改变它的孩子”。

在这一点上，我想说我在胡说八道——以前没有这样做过，在具体实现这种类型的体操时遇到的问题将压倒我抽象推理的能力。但是这个想法很有用——我们有类型安全的节点类型转换，我们将它们聚合在一起并生成一个类型安全的树转换。这棵树不仅仅是一棵通用变体的抽象树，而是一棵树，其中每个节点都知道其子节点允许的类型是什么，并且递归地知道相同的类型。如果我们在兔子洞里走得足够远，我们甚至可以处理“这必须正好有 3 个孩子，第一个是 an int，第二个是 a Bob，第三个是 a ”。double

c++ - 在 C++ 中表示多遍抽象语法树 (AST)？

2 回答 2

Related

Reference