更新
我有一个大学项目,我应该在其中使用 RPis 建立一个集群。现在我们有一个功能齐全的系统,开启了 BLCR/MPICH。BLCR 非常适合与 lib 链接的正常进程。我们必须从我们的管理 Web 界面展示的演示是:
- 并行执行作业
- 跨节点的进程迁移
- MPI 容错
我们可以使用最简单的计算。第一个我们很容易得到,也有 MPI。第二点我们实际上只使用正常进程(没有 MPI)。关于第三点,我不太清楚如何实现主从 MPI 方案,在该方案中我可以重新启动从属进程,这也影响了第二点,因为我们应该/可以/必须对从属进程进行检查点,杀死/停止它并在另一个节点上重新启动它。我知道我必须自己处理 MPI_Errors 但如何恢复该过程?如果有人至少可以给我发一个链接或论文(带有解释),那就太好了。
提前致谢
更新: 如前所述,我们的 BLCR+MPICH 东西有效或似乎有效。但是...当我启动 MPI Processes 检查点似乎运行良好。
这里的证明:
... snip ...
Benchmarking: dynamic_5: md5($s.$p.$s) [32/32 128x1 (MD5_Body)]... DONE
Many salts: 767744 c/s real, 767744 c/s virtual
Only one salt: 560896 c/s real, 560896 c/s virtual
Benchmarking: dynamic_5: md5($s.$p.$s) [32/32 128x1 (MD5_Body)]... [proxy:0:0@node2] requesting checkpoint
[proxy:0:0@node2] checkpoint completed
[proxy:0:1@node1] requesting checkpoint
[proxy:0:1@node1] checkpoint completed
[proxy:0:2@node3] requesting checkpoint
[proxy:0:2@node3] checkpoint completed
... snip ...
如果我在任何节点上杀死一个从属进程,我会在这里得到:
... snip ...
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 9
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
... snip ...
没关系,因为我们有一个检查点,所以我们可以重新启动我们的应用程序。但它不起作用:
pi 7380 0.0 0.2 2984 1012 pts/4 S+ 16:38 0:00 mpiexec -ckpointlib blcr -ckpoint-prefix /tmp -ckpoint-num 0 -f /tmp/machinefile -n 3
pi 7381 0.1 0.5 5712 2464 ? Ss 16:38 0:00 /usr/bin/ssh -x 192.168.42.101 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 0
pi 7382 0.1 0.5 5712 2464 ? Ss 16:38 0:00 /usr/bin/ssh -x 192.168.42.102 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 1
pi 7383 0.1 0.5 5712 2464 ? Ss 16:38 0:00 /usr/bin/ssh -x 192.168.42.105 "/usr/local/bin/mpich/bin/hydra_pmi_proxy" --control-port masterpi:47698 --rmk user --launcher ssh --demux poll --pgid 0 --retries 10 --usize -2 --proxy-id 2
pi 7438 0.0 0.1 3548 868 pts/1 S+ 16:40 0:00 grep --color=auto mpi
我不知道为什么,但我第一次在每个节点上重新启动应用程序时,进程似乎已重新启动(我使用top或ps aux | grep "john"得到它,但没有输出到管理(或在管理控制台上) /terminal) 显示。它只是在显示我后挂断:
mpiexec -ckpointlib blcr -ckpoint-prefix /tmp -ckpoint-num 0 -f /tmp/machinefile -n 3
Warning: Permanently added '192.168.42.102' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.42.101' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.42.105' (ECDSA) to the list of known hosts.
我的计划 B 只是用自己的应用程序测试 BLCR/MPICH 的东西是否真的有效。也许约翰有一些麻烦。
提前致谢
**
更新
** 简单的 hello world 的下一个问题。我慢慢地失望了。可能是我太糊涂了。
mpiexec -ckpointlib blcr -ckpoint-prefix /tmp/ -ckpoint-interval 3 -f /tmp/machinefile -n 4 ./hello
Warning: Permanently added '192.168.42.102' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.42.105' (ECDSA) to the list of known hosts.
Warning: Permanently added '192.168.42.101' (ECDSA) to the list of known hosts.
[proxy:0:0@node2] requesting checkpoint
[proxy:0:0@node2] checkpoint completed
[proxy:0:1@node1] requesting checkpoint
[proxy:0:1@node1] checkpoint completed
[proxy:0:2@node3] requesting checkpoint
[proxy:0:2@node3] checkpoint completed
[proxy:0:0@node2] requesting checkpoint
[proxy:0:0@node2] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:0@node2] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
[proxy:0:0@node2] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0@node2] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:1@node1] requesting checkpoint
[proxy:0:1@node1] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:1@node1] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
[proxy:0:1@node1] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@node1] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[proxy:0:2@node3] requesting checkpoint
[proxy:0:2@node3] HYDT_ckpoint_checkpoint (./tools/ckpoint/ckpoint.c:111): Previous checkpoint has not completed.[proxy:0:2@node3] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:905): checkpoint suspend failed
[proxy:0:2@node3] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:2@node3] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@masterpi] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec@masterpi] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec@masterpi] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec@masterpi] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
你好ç
/* C Example */
#include <stdio.h>
#include <mpi.h>
int main (argc, argv)
int argc;
char *argv[];
{
int rank, size, i, j;
char hostname[1024];
hostname[1023] = '\0';
gethostname(hostname, 1023);
MPI_Init (&argc, &argv); /* starts MPI */
MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */
MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */
i = 0;
for(i ; i < 400000000; i++){
for(j; j < 4000000; j++){
}
}
printf("%s done...", hostname);
printf("%s: %d is alive\n", hostname, getpid());
MPI_Finalize();
return 0;
}