c - 我在 C 中构建决策树时如何存储它？

Question

这里的概念问题。

我正在递归地构建决策树。函数的每次迭代都采用训练样本的一个子集，遍历所有特征和每个特征内所有可能的拆分，找到可能的最佳拆分，将子集拆分为两个较小的子集，然后调用自身（函数）两次，每次调用一次分裂子集。

我之前在 MatLab 中对此进行了编码，但它运行得太慢了，所以现在我在 C 中尝试它（我不太熟悉）。在 MatLab 中，我使用了一个全局“拆分”矩阵来保存每个拆分的信息（哪个特征，该特征中的什么值，如果这是一个叶节点，分类是什么，每个子节点的行号），并且这样我就可以使用新的测试数据点遍历该矩阵以找到其分类。

看起来 C 中的全局二维数组可以使用头文件，但如果有另一种方法，我宁愿不进入头文件。问题是，因为函数是递归调用的，所以我很难知道“拆分”中的下一个可用行是什么。我可以做一些事情，比如孩子的行是父行的 2*i 和 2*i+1，但是对于有很多拆分的大型数组，这将需要大量的初始存储。

有任何想法吗？

score 1 · Accepted Answer

在我看来，您必须放弃 2D 数组来表示您的树。C 中任意程度的树通常如下所示：

struct node
{       struct node ** children;
        int num_children;
        /* Values in the node/leafs */
};

如果树的度数是固定的，或者对于每个节点都有一个最大值，那么下面会做

struct node
{       struct node * children;
        int num_children; /* If degree has only a maximum */
        /* Values in the node/leafs */
};

您将不得不使用malloc和朋友为节点及其子节点分配内存。

关于头文件：头文件是一种祝福（在 C 中是这样），而不是一种诅咒，但如果你坚持不这样做，那么它们总是可以替换它们的#include实例。

如果您要从 MatLab 转向另一种语言以加快实现速度，那么您可能需要首先考虑 C 之外的其他语言。Java、Python 甚至 Haskell 之类的语言可能会为您提供类似的加速，但对于所有指针来说都没有那么麻烦。

score 0 · Accepted Answer

在 C 中使用这种函数式设计并不漂亮，因为无法保证递归调用将被优化为循环，并且没有匿名函数。至少 C++ 中有 lambda；我建议 C++ 更适合于此，尽管 AFAIK 仍然不能保证 C++ 中的优化。

为了避免可能导致堆栈增长的递归，每个分支都需要返回它选择的下一个分支。调用者（main）然后循环返回值，并在返回值是我们的终端值时终止循环。我们将分支类型定义为函数指针：

typedef void branch();

然后我们声明每个实际分支返回一个分支类型：

branch initial(void) {
    /* do initial processing */
    srand(time(NULL));
    int x = rand() % 2;
    return x == 0 ? left : right;
}

branch terminal(void) {
    /* This should never be called */
    assert(0);
    return NULL;
}

branch left(void) {
    /* do left processing */
    return terminal; /* return a terminal branch to indicate no further
                      * processing */
}

branch right(void) {
    int x;
    /* do right processing, storing either a 0 in x to indicate right_left
     * as the selected branch, or 1 in x to indicate right_right...
     */
    return x == 0 ? right_left : right_right;
}

branch right_left(void) {
    /* do right_left processing */
    return initial; /* return initial to repeat that branch */
}

branch right_right(void) {
    /* do right_right processing; */
    return right; /* return right to repeat that branch */
}

...并循环返回值将如下所示：

int main(void) {
    branch *(b)(void) = initial;
    while (b != terminal) {
        b = b();
    }
}

c - 我在 C 中构建决策树时如何存储它？

2 回答 2

Related

Reference