2

我正在尝试提交我的神鹰工作,但它一直给我一个错误说:

ERROR: Can't find address of local schedd

我是一个初学者的秃鹰用户,我不太确定这意味着什么。

此外,当我输入 condor_q 时,我收到以下错误消息:

Error: Can't find address for schedd (name)

Extra Info: You probably saw this error because the condor_schedd is not  running on the machine you are trying to query. If the condor_schedd is not  running, the Condor system will not be able to find an address and port to  connect to and satisfy this request. Please make sure the Condor daemons are  running and try again.

  Extra Info: If the condor_schedd is running on the machine you are trying to  query and you still see the error, the most likely cause is that you have  setup a personal Condor, you have not defined SCHEDD_NAME in your  condor_config file, and something is wrong with your SCHEDD_ADDRESS_FILE  setting. You must define either or both of those settings in your config  file, or you must use the -name option to condor_q. Please see the Condor  manual for details on SCHEDD_NAME and SCHEDD_ADDRESS_FILE.

有趣的是 condor_status 工作得很好(我可以看到所有集群的状态)。

我做了一些研究,它说我需要使用公共目录才能访问它。是否有针对秃鹰提交/队列的特定目录?

4

3 回答 3

2

检查 condor 调度程序是否正在运行(您可以使用 $ ps aux | grep condor 查看您机器中的所有 condor* 进程)

如果 sched 没有运行,您需要将其添加到中央管理器机器 conf 中的守护程序列表(包含 MASTER、STARTD、NEGOTIATOR 等列表的行)

顺便说一句:condor 状态工作正常,因为 COLLECTOR 守护程序正在运行。

于 2015-06-09T19:50:11.657 回答
0

对我来说,您不能在交互式作业中提交批处理作业。确保您在头节点上。

我的头节点:

(automl-meta-learning) miranda9~/automl-meta-learning $ hostname
vision-sched.cs.illinois.edu

计算节点:

(automl-meta-learning) miranda9~/automl-meta-learning $ hostname
vision-19.cs.illinois.edu
于 2020-10-22T14:51:13.080 回答
0

这可能与权限错误有关。我遇到了同样的错误,按照以下几行完成,问题已解决。

mkdir -p /var/run/condor  # If it does not exist
mkdir -p /var/lock/condor # If it does not exist

# Recreate them from scratch
sudo rm -rf /var/lib/condor
sudo mkdir -p /var/lib/condor/spool/local_univ_execute
sudo mkdir -p /var/lib/condor/execute
sudo chown -R condor: /var/lib/condor
sudo chmod 1777 /var/lib/condor/spool/local_univ_execute
sudo chmod 1777 /var/lib/condor/execute

mkdir -p /var/log/condor/
sudo chown -R condor: /var/log/condor
sudo chmod 1777 /var/log/condor

# Kill all the condor daemons you have running,
sudo service condor stop
sudo killall condor
sudo killall condor_procd

sudo service condor start # Condor should run as a system service. 

$ ps auxwwww | grep condor # You should see all processes run under condor.
condor      7656  0.0  0.2  47508  4644 ?        Ss   08:43   0:00 /usr/sbin/condor_master -pidfile /var/run/condor/condor.pid
root        7699  0.2  0.1  24384  3920 ?        S    08:43   0:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 126
condor      7700  0.0  0.2  47004  5436 ?        Ss   08:43   0:00 condor_shared_port -f
condor      7701  0.1  0.3  57252  6620 ?        Ss   08:43   0:00 condor_collector -f
condor      7704  0.1  0.3  48352  6816 ?        Ss   08:43   0:00 condor_startd -f
condor      7705  0.0  0.3  58052  7188 ?        Ss   08:43   0:00 condor_schedd -f
condor      7706  0.0  0.2  47500  5880 ?        Ss   08:43   0:00 condor_negotiator -f

$ condor_q # check condor_q works or not
-- Schedd: condor@ebloc : <127.0.0.1:9618?... @ 10/26/18 08:46:06
OWNER BATCH_NAME      SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS

0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
于 2018-10-28T07:51:52.010 回答