c - unicode collation NIF running slower than Pure Erlang implementation

Question

I'm trying to optimise existing unicode collation library(written in Erlang) by rewriting it as a NIF implementation. Prime reason is because collation is CPU intensive operation.

Link to implementation: https://github.com/abhi-bit/merger

Unicode collation of 1M rows via Pure Erlang based priority queue:

erlc *.erl; ERL_LIBS="..:$ERL_LIBS" erl -noshell -s perf_couch_skew main 1000000 -s init stop
Queue size: 1000000
12321.649 ms

Unicode collation of 1M rows via NIF based binomial heap:

erlc *.erl; ERL_LIBS="..:$ERL_LIBS" erl -noshell -s perf_merger main 1000000 -s init stop
Queue size: 1000000
15871.965 ms

This is unusual, I was expecting it to be probably ~10X faster.

I turned on eprof/fprof but they aren't of much use when it comes to NIF modules, below is what eprof said about the prominent functions

FUNCTION                                         CALLS      %     TIME  [uS / CALLS]
--------                                         -----    ---     ----  [----------]
merger:new/0                                         1   0.00        0  [      0.00]
merger:new/2                                         1   0.00        0  [      0.00]
merger:size/1                                   100002   0.31    19928  [      0.20]
merger:in/3                                     100000   3.29   210620  [      2.11]
erlang:put/2                                   2000000   6.63   424292  [      0.21]
merger:out/1                                    100000  14.35   918834  [      9.19]

I'm sure, NIF implementation could be made faster because I've a pure C implementation of unicode collation based on binary Heap using dynamic array and that's much much faster.

$ make
gcc -I/usr/local/Cellar/icu4c/55.1/include  -L/usr/local/Cellar/icu4c/55.1/lib  min_heap.c collate_json.c kway_merge.c kway_merge_test.c -o output -licui18n -licuuc -licudata
./output
Merging 1 arrays each of size 1000000
mergeKArrays took 84.626ms

Specific questions I've here:

How much slowdown is expected because of Erlang <-> C communication in a NIF module? In this case, slowdown is probably 30x or more between pure C and NIF implementation
What tools could be useful to debug NIF related slowdown(like in this case)? I tried using perf top to see the function call, top ones(some hex addresses were showing) were coming from "beam.smp".
What are possible areas that I should look at optimising a NIF? For example: I've heard that one should keep data being transferred between Erlang to C and vice-versa minimal, are there more such areas to consider?

score 3 · Accepted Answer

调用 NIF 的开销很小。当 Erlang 运行时加载一个加载 NIF 的模块时，它会使用仿真器指令修补模块的梁代码以调用 NIF。在调用实现 NIF 的 C 函数之前，指令本身只执行少量设置。这不是导致您的性能问题的区域。

分析 NIF 与分析任何其他 C/C++ 代码非常相似。从您的 Makefile 来看，您似乎是在 OS X 上开发此代码。在该平台上，假设您安装了 XCode，您可以将Instruments 应用程序与 CPU Samples 仪器一起使用，以查看您的代码大部分时间花在哪里。在 Linux 上，你可以使用 valgrind 的callgrind 工具和一个支持 valgrind的Erlang 模拟器来测量你的代码。

例如，如果您在代码上使用这些工具，您会发现perf_merger:main/1大部分时间都merger_nif_heap_get花在CollateJSON. 该函数似乎调用convertUTF8toUChar了createStringFromJSON很多。您的 NIF 似乎也执行了大量的内存分配。这些是您应该关注的领域，以加快您的代码速度。

c - unicode collation NIF running slower than Pure Erlang implementation

1 回答 1

Related

Reference