0

我是一名物理专业的学生,​​试图运行具有随机元素的与研究相关的模拟。模拟可以分成几个非交互部分,每个部分随机演变,因此不需要运行之间的交互。

我稍后会使用不同的代码/文件来分析所有作业返回的结果(与问题无关,仅提供清晰的背景图片来说明正在发生的事情)。

我使用该研究所的 HPC(我将其称为“集群”)运行我的代码的多个副本,这是一个不从任何其他文件读取任何内容的单个 .py 文件(但会创建输出文件)。代码的每个副本/实现都应该为代码的每个单独实现创建一个唯一的工作目录,os.makedirs(path,exist_ok=True)然后使用os.chdir(path). 我已经多次尝试运行它,以以下行为类型结束:

  1. 一些阵列作业运行良好且表现良好。
  2. 其他,相互覆盖(即job1和job2都写入job1目录中的.txt文件)
  3. 其他人只是明确表示根本没有创建目录,但也没有死,所以我猜他们会继续运行并可能将我的数据写入我不知道且无法访问的地方。

这些行为对我来说似乎是完全随机的,因为我事先不知道哪个数组作业可以完美运行,哪个作业具有行为 2、行为 3 或两者兼有(对于大型job-array,我将有一些工作运行良好,一些显示行为 2,一些显示行为 3,以及仅 2 或仅 3)。

我几乎尝试了所有可以在网上找到的东西,例如我在某处读到一个常见的问题os.makedirs是关于一个umask问题,并且os.umask(0)在调用之前添加是一种很好的做法,所以我添加了它。我有时也读过一个集群可能会挂断,所以调用time.sleep几秒钟然后再试一次可能会起作用,所以我也这样做了。还没有解决问题...

我附上了代码的一部分,这可能是检查的罪魁祸首,我之前在代码中设置的数字是哪里N,L,TDT我还导入了库等(请注意,办公室计算机运行 Windows,而集群运行 Linux,所以我仅用于os.name根据我正在运行的操作系统设置我的目录,以便代码无需修改即可在两个系统上运行):

when = datetime.datetime.now()
date = when.date()
worker_num = os.environ['LSB_JOBINDEX']
pid = os.environ['LSB_JOBID']   
work = 'worker'+worker_num
txt_file = 'N{}_L{}_T{}_DT{}'.format(N, L,T, DT)
if os.name == 'nt':
    path = 'D:/My files/Python Scripts/Cluster/{}/{}/{}'.format(date,txt_file,work)
else:
    path = '/home/labs/{}/{}/{}'.format(date,txt_file,work)   
    os.umask(0)    
try: 
    os.makedirs(path, exist_ok=True)
    os.chdir(path)
except OSError:
    time.sleep(10)
    with open('/home/labs/error_{}_{}.txt'.format(txt_file,work),'a+') as f:
        f.write('In {}, at time {}, job ID: {}, which was sent to queue: {}, working on host: {}, failed to create path: {} '.format(date, hour, pid,os.environ['LSB_QUEUE'], os.environ['LSB_HOSTS'], path))    
    os.makedirs(path, exist_ok=True)
    os.chdir(path)

集群的环境是 LSF 环境。为了运行我的代码的多个实现,我使用“arrayjob”命令,即使用 LSF 将相同代码的多个实例(在本例中为 100 个)发送到集群中不同(或相同)主机上的多个不同 CPU。

我还附上了显示我上面描述的错误的示例。行为 2 的示例是以下输出文件:

Stst progress = 10.0% after 37 seconds

Stst progress = 10.0% after 42 seconds

Stst progress = 20.0% after 64 seconds

Stst progress = 20.0% after 75 seconds

Stst progress = 30.0% after 109 seconds

Stst progress = 40.0% after 139 seconds

worker99 is 5.00% finished after 0.586 hours and will finish in approx 11.137 hours
worker99 is 5.00% finished after 0.691 hours and will finish in approx 13.130 hours
worker99 is 10.00% finished after 1.154 hours and will finish in approx 10.382 hours
worker99 is 10.00% finished after 1.340 hours and will finish in approx 12.062 hours
worker99 is 15.00% finished after 1.721 hours and will finish in approx 9.753 hours
worker99 is 15.00% finished after 1.990 hours and will finish in approx 11.275 hours
worker99 is 20.00% finished after 2.287 hours and will finish in approx 9.148 hours
worker99 is 20.00% finished after 2.633 hours and will finish in approx 10.532 hours
worker99 is 25.00% finished after 2.878 hours and will finish in approx 8.633 hours
worker99 is 25.00% finished after 3.275 hours and will finish in approx 9.826 hours
worker99 is 30.00% finished after 3.443 hours and will finish in approx 8.033 hours
worker99 is 30.00% finished after 3.921 hours and will finish in approx 9.149 hours
worker99 is 35.00% finished after 4.015 hours and will finish in approx 7.456 hours
worker99 is 35.00% finished after 4.566 hours and will finish in approx 8.480 hours
worker99 is 40.00% finished after 4.616 hours and will finish in approx 6.924 hours
worker99 is 45.00% finished after 5.182 hours and will finish in approx 6.334 hours
worker99 is 40.00% finished after 5.209 hours and will finish in approx 7.814 hours
worker99 is 50.00% finished after 5.750 hours and will finish in approx 5.750 hours
worker99 is 45.00% finished after 5.981 hours and will finish in approx 7.310 hours
worker99 is 55.00% finished after 6.322 hours and will finish in approx 5.173 hours
worker99 is 50.00% finished after 6.623 hours and will finish in approx 6.623 hours
worker99 is 60.00% finished after 6.927 hours and will finish in approx 4.618 hours
worker99 is 55.00% finished after 7.266 hours and will finish in approx 5.945 hours
worker99 is 65.00% finished after 7.513 hours and will finish in approx 4.046 hours
worker99 is 60.00% finished after 7.928 hours and will finish in approx 5.285 hours
worker99 is 70.00% finished after 8.079 hours and will finish in approx 3.463 hours
worker99 is 65.00% finished after 8.580 hours and will finish in approx 4.620 hours
worker99 is 75.00% finished after 8.644 hours and will finish in approx 2.881 hours
worker99 is 80.00% finished after 9.212 hours and will finish in approx 2.303 hours
worker99 is 70.00% finished after 9.227 hours and will finish in approx 3.954 hours
worker99 is 85.00% finished after 9.778 hours and will finish in approx 1.726 hours
worker99 is 75.00% finished after 9.882 hours and will finish in approx 3.294 hours
worker99 is 90.00% finished after 10.344 hours and will finish in approx 1.149 hours
worker99 is 80.00% finished after 10.532 hours and will finish in approx 2.633 hours

像这样的 .txt 文件,用于跟踪代码的进度,通常由每个作业单独创建并存储在自己的目录中。在这种情况下,由于某种原因,两个不同的作业正在写入同一个文件。当观察在创建目录并确定工作目录后立即创建的不同 .txt 文件时,可以验证这一点:

In 2016-04-01, at time 02:11:51.851948, job ID: 373244, which was sent to
queue: new-short, working on host: cn129.wexac.weizmann.ac.il, has created 
path: /home/labs/2016-04-02/N800_L1600_T10_DT0.5/worker99 

In 2016-04-01, at time 02:12:09.968549, job ID: 373245, which was sent to 
queue: new-medium, working on host: cn293.wexac.weizmann.ac.il, has created 
path: /home/labs/2016-04-02/N800_L1600_T10_DT0.5/worker99   

我非常感谢我能得到解决这个问题的任何帮助,因为它阻碍了我们推进研究。如果需要任何其他细节来解决这个问题,我很乐意提供。
谢谢!

4

1 回答 1

0

查看您提供的错误日志,显示两个作业(373244 和 373245)正在发送到两个不同的队列:

2016-04-01,时间 02:11:51.851948,作业 ID:373244,被发送到 队列:new-short,...

在 2016-04-01,时间 02:12:09.968549,作业 ID:373245,被发送到 队列:new-medium,...

这表明阵列作业被两次发送到两个单独的队列。您可能会查看发出数组作业的代码,以确保它只运行一次,将作业发送到单个队列。

多次发出数组作业会导致我认为您看到的行为。

于 2016-06-01T14:09:51.660 回答