I'm trying to optimise existing unicode collation library(written in Erlang) by rewriting it as a NIF implementation. Prime reason is because collation is CPU intensive operation.
Link to implementation: https://github.com/abhi-bit/merger
Unicode collation of 1M rows via Pure Erlang based priority queue:
erlc *.erl; ERL_LIBS="..:$ERL_LIBS" erl -noshell -s perf_couch_skew main 1000000 -s init stop
Queue size: 1000000
12321.649 ms
Unicode collation of 1M rows via NIF based binomial heap:
erlc *.erl; ERL_LIBS="..:$ERL_LIBS" erl -noshell -s perf_merger main 1000000 -s init stop
Queue size: 1000000
15871.965 ms
This is unusual, I was expecting it to be probably ~10X faster.
I turned on eprof
/fprof
but they aren't of much use when it comes to NIF modules, below is what eprof
said about the prominent functions
FUNCTION CALLS % TIME [uS / CALLS]
-------- ----- --- ---- [----------]
merger:new/0 1 0.00 0 [ 0.00]
merger:new/2 1 0.00 0 [ 0.00]
merger:size/1 100002 0.31 19928 [ 0.20]
merger:in/3 100000 3.29 210620 [ 2.11]
erlang:put/2 2000000 6.63 424292 [ 0.21]
merger:out/1 100000 14.35 918834 [ 9.19]
I'm sure, NIF implementation could be made faster because I've a pure C implementation of unicode collation based on binary Heap using dynamic array and that's much much faster.
$ make
gcc -I/usr/local/Cellar/icu4c/55.1/include -L/usr/local/Cellar/icu4c/55.1/lib min_heap.c collate_json.c kway_merge.c kway_merge_test.c -o output -licui18n -licuuc -licudata
./output
Merging 1 arrays each of size 1000000
mergeKArrays took 84.626ms
Specific questions I've here:
- How much slowdown is expected because of Erlang <-> C communication in a NIF module? In this case, slowdown is probably 30x or more between pure C and NIF implementation
- What tools could be useful to debug NIF related slowdown(like in this case)? I tried using
perf top
to see the function call, top ones(some hex addresses were showing) were coming from "beam.smp". - What are possible areas that I should look at optimising a NIF? For example: I've heard that one should keep data being transferred between Erlang to C and vice-versa minimal, are there more such areas to consider?