0

我正在尝试在专用网络中通过 ssh 建立一个小型 IPython 集群(这一切都曾与 IPython 0.10.0 [原文如此!]):4 个节点 alice、bob、carol、dan ,每个都有 4 个 CPU 内核。控制器在 carol 上运行,所有 PC 都安装了 IPython 2.3.0 的 Ubuntu 14.10。~/.ipython/profile_default 通过 NFS 在所有 PC 之间共享。由于一些内部原因,我无法使用 MPI。

现在,如果集群启动,我只能看到 4 个引擎。我已经增加了 SSHEngineSetLauncher.delay,但这并没有帮助

我试图追查这一点,最后只使用 carol(主机)并尝试通过 SSH 在本地启动四个引擎,但实际上只有一个引擎在运行。

我的 ipclusterconfig.py 看起来像

c = get_config()
c.IPClusterStart.engine_launcher_class = 'SSHEngineSetLauncher'
c.SSHEngineSetLauncher.delay = 10
c.SSHEngineSetLauncher.engines = { 'carol' : 4}#, 'dan' : 4, 'alice' : 4, 'bob' : 4 }

引擎.json:

{
    "next_id": 4,
    "engines": {
        "0": "80d135a7-b8f6-435c-930a-0cde15a6feb2",
        "1": "b69916c3-87c2-4e09-9284-aefe665ba616",
        "2": "f3df3951-5e0b-4694-aa67-7ae66a181551",
        "3": "4311705d-03d4-4e48-a7a9-7be47467c439"}}

作为参考,我添加了日志文件:=> ipcontroller.log

2015-05-21 07:28:24.442 [IPControllerApp] Hub listening on tcp://127.0.0.1:57360 for registration.
2015-05-21 07:28:24.443 [IPControllerApp] Hub using DB backend: 'NoDB'
2015-05-21 07:28:24.695 [IPControllerApp] hub::created hub
2015-05-21 07:28:24.695 [IPControllerApp] writing connection info to /home/lst3si/.ipython/profile_default/security/ipcontroller-client.json
2015-05-21 07:28:24.695 [IPControllerApp] writing connection info to /home/lst3si/.ipython/profile_default/security/ipcontroller-engine.json
2015-05-21 07:28:24.696 [IPControllerApp] task::using Python leastload Task scheduler
2015-05-21 07:28:24.696 [IPControllerApp] Heartmonitor started
2015-05-21 07:28:24.700 [IPControllerApp] Creating pid file: /home/lst3si/.ipython/profile_default/pid/ipcontroller.pid
2015-05-21 07:28:24.707 [IPControllerApp] client::client '\x00\x91y`\x0c' requested u'connection_request'
2015-05-21 07:28:24.707 [IPControllerApp] client::client ['\x00\x91y`\x0c'] connected
2015-05-21 07:28:26.071 [IPControllerApp] client::client '80d135a7-b8f6-435c-930a-0cde15a6feb2' requested u'registration_request'
2015-05-21 07:28:26.103 [IPControllerApp] WARNING | iopub::IOPub message lacks parent: {'parent_header': {}, 'msg_type': u'status', 'msg_id': u'230d5aa1-c395-4b82-a964-a3062e5550a9', 'content': {u'execution_state': u'starting'}, 'header': {u'date': datetime.datetime(2015, 5, 21, 7, 28, 26, 102954), u'username': u'lst3si', u'session': u'80d135a7-b8f6-435c-930a-0cde15a6feb2', u'msg_id': u'230d5aa1-c395-4b82-a964-a3062e5550a9', u'msg_type': u'status'}, 'buffers': [], 'metadata': {}}
2015-05-21 07:28:30.699 [IPControllerApp] registration::finished registering engine 0:80d135a7-b8f6-435c-930a-0cde15a6feb2
2015-05-21 07:28:30.699 [IPControllerApp] engine::Engine Connected: 0
2015-05-21 07:28:36.071 [IPControllerApp] client::client 'b69916c3-87c2-4e09-9284-aefe665ba616' requested u'registration_request'
2015-05-21 07:28:36.102 [IPControllerApp] WARNING | iopub::IOPub message lacks parent: {'parent_header': {}, 'msg_type': u'status', 'msg_id': u'f74a1f38-f3fb-422f-b4ad-0d1724745c64', 'content': {u'execution_state': u'starting'}, 'header': {u'date': datetime.datetime(2015, 5, 21, 7, 28, 36, 102052), u'username': u'lst3si', u'session': u'b69916c3-87c2-4e09-9284-aefe665ba616', u'msg_id': u'f74a1f38-f3fb-422f-b4ad-0d1724745c64', u'msg_type': u'status'}, 'buffers': [], 'metadata': {}}
2015-05-21 07:28:36.285 [IPControllerApp] client::client '\x00\x91y`\r' requested u'connection_request'
2015-05-21 07:28:36.285 [IPControllerApp] client::client ['\x00\x91y`\r'] connected
2015-05-21 07:28:39.699 [IPControllerApp] registration::finished registering engine 1:b69916c3-87c2-4e09-9284-aefe665ba616
2015-05-21 07:28:39.699 [IPControllerApp] engine::Engine Connected: 1
2015-05-21 07:28:46.143 [IPControllerApp] client::client 'f3df3951-5e0b-4694-aa67-7ae66a181551' requested u'registration_request'
2015-05-21 07:28:46.175 [IPControllerApp] WARNING | iopub::IOPub message lacks parent: {'parent_header': {}, 'msg_type': u'status', 'msg_id': u'a3aa09af-6958-4362-a1f4-5df01da8941b', 'content': {u'execution_state': u'starting'}, 'header': {u'date': datetime.datetime(2015, 5, 21, 7, 28, 46, 174675), u'username': u'lst3si', u'session': u'f3df3951-5e0b-4694-aa67-7ae66a181551', u'msg_id': u'a3aa09af-6958-4362-a1f4-5df01da8941b', u'msg_type': u'status'}, 'buffers': [], 'metadata': {}}
2015-05-21 07:28:51.699 [IPControllerApp] registration::finished registering engine 2:f3df3951-5e0b-4694-aa67-7ae66a181551
2015-05-21 07:28:51.700 [IPControllerApp] engine::Engine Connected: 2
2015-05-21 07:28:56.113 [IPControllerApp] client::client '4311705d-03d4-4e48-a7a9-7be47467c439' requested u'registration_request'
2015-05-21 07:28:56.145 [IPControllerApp] WARNING | iopub::IOPub message lacks parent: {'parent_header': {}, 'msg_type': u'status', 'msg_id': u'671288cf-32ea-4a41-8e17-9be4ba1216dd', 'content': {u'execution_state': u'starting'}, 'header': {u'date': datetime.datetime(2015, 5, 21, 7, 28, 56, 144586), u'username': u'lst3si', u'session': u'4311705d-03d4-4e48-a7a9-7be47467c439', u'msg_id': u'671288cf-32ea-4a41-8e17-9be4ba1216dd', u'msg_type': u'status'}, 'buffers': [], 'metadata': {}}
2015-05-21 07:29:00.698 [IPControllerApp] registration::finished registering engine 3:4311705d-03d4-4e48-a7a9-7be47467c439
2015-05-21 07:29:00.700 [IPControllerApp] engine::Engine Connected: 3

=> ipengine.log(看起来都一样,只有“使用 id x 完成注册”,其中引擎的 x 从 0 增加到 3):

2015-05-21 07:28:26.065 [IPEngineApp] Loading url_file u'.ipython/profile_default/security/ipcontroller-engine.json'
2015-05-21 07:28:26.070 [IPEngineApp] Registering with controller at tcp://127.0.0.1:57360
2015-05-21 07:28:26.101 [IPEngineApp] Starting to monitor the heartbeat signal from the hub every 3010 ms.
2015-05-21 07:28:26.102 [IPEngineApp] Using existing profile dir: u'.ipython/profile_default'
2015-05-21 07:28:26.103 [IPEngineApp] Completed registration with id 0
4

1 回答 1

0

我自己解决了这个问题。由于 IPython.utils.localinterfaces.public_ips 中的错误(我报告过),引擎没有启动,由于忽略了 ifconfig 的本地化输出,它返回了“Adresse:127.0.0.1”(我更改了 IP 值)。

作为一种解决方法,我现在使用以下 ipclusterconfig.py(注意 中的--location选项controller_args):

c = get_config()
c.IPClusterEngines.engine_launcher_class = 'SSH'
c.LocalControllerLauncher.controller_args = ['--location=<engine_ip1>', '--ip=*']
c.SSHEngineSetLauncher.engine = { <engine_ip1> : 4, <engine_ip> : 4 }

在此示例中,控制器在本地运行<engine_ip1>

于 2015-05-22T05:24:33.570 回答