concurrency - 错误的进程在其他节点上被杀死？

Question

我编写了一个简单的程序（“控制器”）来在单独的节点（“工作者”）上运行一些计算。原因是如果工作节点内存不足，控制器仍然可以工作：

-module(controller).
-compile(export_all).

p(Msg,Args) -> io:format("~p " ++ Msg, [time() | Args]).

progress_monitor(P,N) ->
    timer:sleep(5*60*1000),
    p("killing the worker which was using strategy #~p~n", [N]),
    exit(P, took_to_long).

start() ->
    start(1).
start(Strat) ->
    P = spawn('worker@localhost', worker, start, [Strat,self(),60000000000]),
    p("starting worker using strategy #~p~n", [Strat]),
    spawn(controller,progress_monitor,[P,Strat]),
    monitor(process, P),
    receive
        {'DOWN', _, _, P, Info} ->
            p("worker using strategy #~p died. reason: ~p~n", [Strat, Info]);
        X ->
            p("got result: ~p~n", [X])
    end,
    case Strat of
        4 -> p("out of strategies. giving up~n", []);
        _ -> timer:sleep(5000), % wait for node to come back
             start(Strat + 1)
    end.

为了测试它，我特意编写了 3 个会占用大量内存和崩溃的阶乘实现，以及使用尾递归以避免占用太多空间的第四个实现：

-module(worker).
-compile(export_all).

start(1,P,N) -> P ! factorial1(N);
start(2,P,N) -> P ! factorial2(N);
start(3,P,N) -> P ! factorial3(N);
start(4,P,N) -> P ! factorial4(N,1).

factorial1(0) -> 1;
factorial1(N) -> N*factorial1(N-1).

factorial2(N) ->
    case N of
        0 -> 1;
        _ -> N*factorial2(N-1)
    end.

factorial3(N) -> lists:foldl(fun(X,Y) -> X*Y end, 1, lists:seq(1,N)).

factorial4(0, A) -> A;
factorial4(N, A) -> factorial4(N-1, A*N).

请注意，即使使用尾递归版本，我也会使用 60000000000 来调用它，即使使用factorial4. 这是运行控制器的输出：

$ erl -sname 'controller@localhost'
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(controller@localhost)1> c(worker).
{ok,worker}
(controller@localhost)2> c(controller).
{ok,controller}
(controller@localhost)3> controller:start().
{23,24,28} starting worker using strategy #1
{23,25,13} worker using strategy #1 died. reason: noconnection
{23,25,18} starting worker using strategy #2
{23,26,2} worker using strategy #2 died. reason: noconnection
{23,26,7} starting worker using strategy #3
{23,26,40} worker using strategy #3 died. reason: noconnection
{23,26,45} starting worker using strategy #4
{23,29,28} killing the worker which was using strategy #1
{23,29,29} worker using strategy #4 died. reason: took_to_long
{23,29,29} out of strategies. giving up
ok

它几乎可以工作，但是工人 #4 被杀得太早了（应该接近 23:31:45，而不是 23:29:29）。更深入地看，只有 1 号工人被试图杀死，没有其他人被杀死。所以工人＃4不应该死，但它确实死了。为什么？我们甚至可以看到原因是took_to_long，并且progress_monitor#1 开始于 23:24:28，比 23:29:29 早五分钟。所以看起来progress_monitor#1 杀死了工人 #4 而不是工人 #1。为什么它杀死了错误的进程？

这是我运行控制器时工作人员的输出：

$ while true; do erl -sname 'worker@localhost'; done
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1> 
Crash dump was written to: erl_crash.dump
eheap_alloc: Cannot allocate 2733560184 bytes of memory (of type "old_heap").
Aborted
Erlang R16B (erts-5.10.1) [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]

Eshell V5.10.1  (abort with ^G)
(worker@localhost)1>

score 2 · Accepted Answer

有几个问题，最终你遇到了创建数字环绕。

由于您没有取消该progress_monitor过程，因此它将在 5 分钟后始终发送退出信号。

计算时间长和/或 VM 速度慢，因此在进程 1 的进度监视器启动后 5 分钟，进程 4 仍在运行。

4个worker节点依次启动，同名workers@localhost，第一个和第四个节点的创建编号相同。

创建编号（引用和 pid 中的创建字段）是一种机制，可防止崩溃节点创建的 pid 和引用被具有相同名称的新节点解释。当您在节点早已消失后尝试杀死工作人员 1 时，您在代码中所期望的正是您所期望的，您不打算在重新启动的节点中杀死进程。

当节点发送 pid 或引用时，它会对其创建编号进行编码。当它从另一个节点接收到 pid 或引用时，它会检查 pid 中的创建编号是否与它自己的创建编号匹配。创建编号epmd 按照 1,2,3 序列进行归属。

不幸的是，在这里，当第 4 个节点收到退出消息时，创建编号匹配，因为此序列已包装。由于节点产生进程并在之前做了完全相同的事情（初始化 erlang），因此节点 4 的工作进程的 pid 与节点 1 的工作进程的 pid 匹配。

结果，控制器最终杀死了工人 4，认为它是工人 1。

为了避免这种情况，如果在 pid 的生命周期内可以有 4 个工作人员或控制器中的引用，则您需要比创建数量更强大的东西。

concurrency - 错误的进程在其他节点上被杀死？

1 回答 1

Related

Reference