我正在尝试在小型两台电脑系统中安装 slurm。但是在启动 slurmd 时出现以下错误
Job for slurmd.service failed because the control process exited with error code.
See "systemctl status slurmd.service" and "journalctl -xe" for details.
systemctl status slurmd.service 和 journalctl -xe 的输出如下
● slurmd.service - Slurm node daemon
Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Fri 2020-12-04 13:18:51 CST; 4min 50s ago
Docs: man:slurmd(8)
Process: 26501 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=1/FAILURE)
12月 04 13:18:51 Y-Cluster-Node1 systemd[1]: Starting Slurm node daemon...
12月 04 13:18:51 Y-Cluster-Node1 slurmd[26501]: fatal: Unable to determine this slurmd's NodeName
12月 04 13:18:51 Y-Cluster-Node1 systemd[1]: slurmd.service: Control process exited, code=exited status=1
12月 04 13:18:51 Y-Cluster-Node1 systemd[1]: slurmd.service: Failed with result 'exit-code'.
12月 04 13:18:51 Y-Cluster-Node1 systemd[1]: Failed to start Slurm node daemon.
12月 04 13:21:05 Y-Cluster-Node1 sshd[26624]: Disconnected from authenticating user root 150.158.213.234 port 54962 [preauth]
12月 04 13:21:23 Y-Cluster-Node1 sshd[26632]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=115.68.207.186 user=root
12月 04 13:21:25 Y-Cluster-Node1 sshd[26632]: Failed password for root from 115.68.207.186 port 58882 ssh2
12月 04 13:21:25 Y-Cluster-Node1 sshd[26632]: Received disconnect from 115.68.207.186 port 58882:11: Bye Bye [preauth]
12月 04 13:21:25 Y-Cluster-Node1 sshd[26632]: Disconnected from authenticating user root 115.68.207.186 port 58882 [preauth]
12月 04 13:21:25 Y-Cluster-Node1 sshd[26630]: Connection closed by 212.64.12.236 port 46106 [preauth]
12月 04 13:22:13 Y-Cluster-Node1 sshd[26635]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=118.25.24.84 user=root
12月 04 13:22:14 Y-Cluster-Node1 sshd[26637]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=111.125.70.22 user=root
12月 04 13:22:14 Y-Cluster-Node1 sshd[26635]: Failed password for root from 118.25.24.84 port 47018 ssh2
12月 04 13:22:15 Y-Cluster-Node1 sshd[26635]: Received disconnect from 118.25.24.84 port 47018:11: Bye Bye [preauth]
12月 04 13:22:15 Y-Cluster-Node1 sshd[26635]: Disconnected from authenticating user root 118.25.24.84 port 47018 [preauth]
12月 04 13:22:15 Y-Cluster-Node1 sshd[26637]: Failed password for root from 111.125.70.22 port 58216 ssh2
12月 04 13:22:15 Y-Cluster-Node1 sshd[26637]: Received disconnect from 111.125.70.22 port 58216:11: Bye Bye [preauth]
12月 04 13:22:15 Y-Cluster-Node1 sshd[26637]: Disconnected from authenticating user root 111.125.70.22 port 58216 [preauth]
12月 04 13:22:16 Y-Cluster-Node1 sshd[26639]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=72.167.227.34 user=root
12月 04 13:22:18 Y-Cluster-Node1 sshd[26639]: Failed password for root from 72.167.227.34 port 56304 ssh2
12月 04 13:22:18 Y-Cluster-Node1 sshd[26639]: Received disconnect from 72.167.227.34 port 56304:11: Bye Bye [preauth]
12月 04 13:22:18 Y-Cluster-Node1 sshd[26639]: Disconnected from authenticating user root 72.167.227.34 port 56304 [preauth]
12月 04 13:22:32 Y-Cluster-Node1 sshd[26641]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=182.138.239.224 user=root
12月 04 13:22:34 Y-Cluster-Node1 sshd[26641]: Failed password for root from 182.138.239.224 port 48870 ssh2
12月 04 13:22:36 Y-Cluster-Node1 sshd[26641]: Received disconnect from 182.138.239.224 port 48870:11: Bye Bye [preauth]
12月 04 13:22:36 Y-Cluster-Node1 sshd[26641]: Disconnected from authenticating user root 182.138.239.224 port 48870 [preauth]
12月 04 13:22:56 Y-Cluster-Node1 sshd[26648]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=81.68.123.185 user=root
12月 04 13:22:58 Y-Cluster-Node1 sshd[26648]: Failed password for root from 81.68.123.185 port 60848 ssh2
12月 04 13:23:00 Y-Cluster-Node1 sshd[26648]: Received disconnect from 81.68.123.185 port 60848:11: Bye Bye [preauth]
12月 04 13:23:00 Y-Cluster-Node1 sshd[26648]: Disconnected from authenticating user root 81.68.123.185 port 60848 [preauth]
12月 04 13:23:02 Y-Cluster-Node1 sshd[26652]: Connection closed by 139.217.221.89 port 35808 [preauth]
12月 04 13:23:13 Y-Cluster-Node1 sshd[26654]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=159.65.1.41 user=root
12月 04 13:23:16 Y-Cluster-Node1 sshd[26654]: Failed password for root from 159.65.1.41 port 40538 ssh2
12月 04 13:23:16 Y-Cluster-Node1 sshd[26654]: Received disconnect from 159.65.1.41 port 40538:11: Bye Bye [preauth]
12月 04 13:23:16 Y-Cluster-Node1 sshd[26654]: Disconnected from authenticating user root 159.65.1.41 port 40538 [preauth]
12月 04 13:23:43 Y-Cluster-Node1 sshd[26656]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=222.222.31.70 user=root
12月 04 13:23:46 Y-Cluster-Node1 sshd[26656]: Failed password for root from 222.222.31.70 port 35282 ssh2
12月 04 13:23:46 Y-Cluster-Node1 sshd[26656]: Received disconnect from 222.222.31.70 port 35282:11: Bye Bye [preauth]
12月 04 13:23:46 Y-Cluster-Node1 sshd[26656]: Disconnected from authenticating user root 222.222.31.70 port 35282 [preauth]
12月 04 13:24:02 Y-Cluster-Node1 sshd[26660]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=150.158.213.234 user=root
12月 04 13:24:04 Y-Cluster-Node1 sshd[26660]: Failed password for root from 150.158.213.234 port 36350 ssh2
12月 04 13:24:05 Y-Cluster-Node1 sshd[26660]: Received disconnect from 150.158.213.234 port 36350:11: Bye Bye [preauth]
12月 04 13:24:05 Y-Cluster-Node1 sshd[26660]: Disconnected from authenticating user root 150.158.213.234 port 36350 [preauth]
我试图理解这个问题,它看起来像是控制节点(node1)无法访问计算节点(node2)的连接问题。
我做了一些搜索,有人提到这可能是由于 UID 和 GID 不匹配。如安装指南中所述,“确保时钟、用户和组(UID 和 GID)在集群中同步。” 我自己没有发现任何关于 UID/GID 的问题,有没有办法对此进行检查?有人可以帮我看看吗?
一些附加信息:使用“munge -n | unmunge”我在两个节点上都得到了以下信息
y-cluster@Y-Cluster-Node1:~$ munge -n | unmunge
STATUS: Success (0)
ENCODE_HOST: Y-Cluster-Node1 (192.168.1.111)
ENCODE_TIME: 2020-12-04 15:00:18 +0800 (1607065218)
DECODE_TIME: 2020-12-04 15:00:18 +0800 (1607065218)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: y-cluster (1000)
GID: y-cluster (1000)
LENGTH: 0
y-cluster@Y-Cluster-Node2:~/.ssh$ munge -n | unmunge
STATUS: Success (0)
ENCODE_HOST: Y-Cluster-Node2 (192.168.1.112)
ENCODE_TIME: 2020-12-04 15:00:20 +0800 (1607065220)
DECODE_TIME: 2020-12-04 15:00:20 +0800 (1607065220)
TTL: 300
CIPHER: aes128 (4)
MAC: sha256 (5)
ZIP: none (0)
UID: y-cluster (1000)
GID: y-cluster (1000)
LENGTH: 0
两者看起来都很好,相同的 UID/GID/TIME。从“slurmctld -Dcvvv”,我得到以下错误,我想知道它与某些日志文件的所有权有关吗?
y-cluster@Y-Cluster-Node1:~$ slurmctld -Dcvvv
slurmctld: debug: Log file re-opened
slurmctld: killing old slurmctld[4787]