performance - Haskell FFI/C 的性能考虑？

Question

如果使用 Haskell 作为从我的 C 程序调用的库，那么调用它会对性能产生什么影响？例如，如果我有一个问题世界数据集，比如 20kB 的数据，我想运行类似的东西：

// Go through my 1000 actors and have them make a decision based on
// HaskellCode() function, which is compiled Haskell I'm accessing through
// the FFI.  As an argument, send in the SAME 20kB of data to EACH of these
// function calls, and some actor specific data
// The 20kB constant data defines the environment and the actor specific
// data could be their personality or state
for(i = 0; i < 1000; i++)
   actor[i].decision = HaskellCode(20kB of data here, actor[i].personality);

这里会发生什么——我是否可以将 20kB 的数据作为全局不可变引用保存在 Haskell 代码可以访问的某个地方，或者我必须每次都创建该数据的副本？

令人担忧的是，这些数据可能会更大，更大——我还希望编写对更大数据集起作用的算法，使用与多次调用 Haskell 代码所使用的不可变数据相同的模式。

另外，我想将其并行化，例如 dispatch_apply() GCD 或 Parallel.ForEach(..) C#。我在 Haskell 之外进行并行化的理由是，我知道我将始终对许多单独的函数调用（即 1000 个参与者）进行操作，因此在 Haskell 函数中使用细粒度并行化并不比在 C 级别管理它好。正在运行 FFI Haskell 实例“线程安全”，我如何实现这一点 - 每次启动并行运行时是否需要初始化 Haskell 实例？（如果必须的话，似乎很慢..）如何以良好的性能实现这一目标？

score 20 · Accepted Answer

调用它对性能有什么影响

假设您只启动一次 Haskell 运行时（像这样），在我的机器上，从 C 向 Haskell 进行函数调用，在边界上来回传递一个 Int，大约需要80,000 个周期（在我的 Core 2 上为31,000 ns ） - - 通过rdstc寄存器实验确定

我是否有可能将那 20kB 的数据作为全局不可变引用保存在 Haskell 代码可以访问的某个地方

是的，这当然是可能的。如果数据确实是不可变的，那么无论您是否：

通过编组将数据在语言边界上来回穿梭；
来回传递对数据的引用；
或将其缓存在IORefHaskell 端。

哪种策略最好？这取决于数据类型。最惯用的方法是来回传递对 C 数据的引用，将其视为 Haskell 端的ByteString或Vector。

我想并行化这个

我强烈建议然后反转控件，并从 Haskell 运行时进行并行化——它将更加健壮，因为该路径已经过大量测试。

关于线程安全，对在同一运行时运行的函数进行并行调用显然是安全foreign exported的——尽管相当肯定没有人尝试过这样做以获得并行性。调用获取一种能力，本质上是一个锁，因此多个调用可能会阻塞，从而减少并行的机会。在多核情况下（例如-N4左右），您的结果可能会有所不同（有多种功能可用），但是，这几乎可以肯定是提高性能的不好方法。

同样，通过 Haskell 进行许多并行函数调用forkIO是一个更好的文档化、更好的测试路径，与在 C 端进行工作相比，开销更少，并且最终可能更少的代码。

只需调用您的 Haskell 函数，该函数将通过许多 Haskell 线程进行并行处理。简单的！

score 9 · Accepted Answer

我在我的一个应用程序中使用了 C 和 Haskell 线程的混合，并没有注意到在两者之间切换对性能有太大影响。所以我制作了一个简单的基准……它比 Don 的更快/更便宜。这是在 2.66GHz i7 上测量 1000 万次迭代：

$ ./foo
IO  : 2381952795 nanoseconds total, 238.195279 nanoseconds per, 160000000 value
Pure: 2188546976 nanoseconds total, 218.854698 nanoseconds per, 160000000 value

在 OSX 10.6 上使用 GHC 7.0.3/x86_64 和 gcc-4.2.1 编译

ghc -no-hs-main -lstdc++ -O2 -optc-O2 -o foo ForeignExportCost.hs Driver.cpp

哈斯克尔：

{-# LANGUAGE ForeignFunctionInterface #-}

module ForeignExportCost where

import Foreign.C.Types

foreign export ccall simpleFunction :: CInt -> CInt
simpleFunction i = i * i

foreign export ccall simpleFunctionIO :: CInt -> IO CInt
simpleFunctionIO i = return (i * i)

驱动它的 OSX C++ 应用程序应该很容易适应 Windows 或 Linux：

#include <stdio.h>
#include <mach/mach_time.h>
#include <mach/kern_return.h>
#include <HsFFI.h>
#include "ForeignExportCost_stub.h"

static const int s_loop = 10000000;

int main(int argc, char** argv) {
    hs_init(&argc, &argv);

    struct mach_timebase_info timebase_info = { };
    kern_return_t err;
    err = mach_timebase_info(&timebase_info);
    if (err != KERN_SUCCESS) {
        fprintf(stderr, "error: %x\n", err);
        return err;
    }

    // timing a function in IO
    uint64_t start = mach_absolute_time();
    HsInt32 val = 0;
    for (int i = 0; i < s_loop; ++i) {
        val += simpleFunctionIO(4);
    }

    // in nanoseconds per http://developer.apple.com/library/mac/#qa/qa1398/_index.html
    uint64_t duration = (mach_absolute_time() - start) * timebase_info.numer / timebase_info.denom;
    double duration_per = static_cast<double>(duration) / s_loop;
    printf("IO  : %lld nanoseconds total, %f nanoseconds per, %d value\n", duration, duration_per, val);

    // run the loop again with a pure function
    start = mach_absolute_time();
    val = 0;
    for (int i = 0; i < s_loop; ++i) {
        val += simpleFunction(4);
    }

    duration = (mach_absolute_time() - start) * timebase_info.numer / timebase_info.denom;
    duration_per = static_cast<double>(duration) / s_loop;
    printf("Pure: %lld nanoseconds total, %f nanoseconds per, %d value\n", duration, duration_per, val);

    hs_exit();
}

score 3 · Accepted Answer

3

如果您传递指针，Haskell 可以窥视那 20k blob。

于 2011-04-14T16:57:59.803 回答

score 1 · Accepted Answer

免责声明：我没有使用 FFI 的经验。

但在我看来，如果你想重用 20 Kb 的数据而不是每次都传递它，那么你可以简单地使用一个方法来获取“个性”列表并返回“决策”列表.

所以如果你有一个功能

f :: LotsaData -> Personality -> Decision
f data p = ...

那为什么不做一个辅助函数

helper :: LotsaData -> [Personality] -> [Decision]
helper data ps = map (f data) ps

并调用它？但是，使用这种方式，如果你想并行化，你需要在 Haskell 端使用并行列表和并行映射。

我请专家解释是否/如何将 C 数组轻松编组为 Haskell 列表（或类似结构）。

performance - Haskell FFI/C 的性能考虑？

4 回答 4

Related

Reference