
Some frined of mine is working in a neural network program that must run for several days before its end.

For some power issues, during a fem minutes, the computer was keep running thanks to the no-breaks. But, I don't know if somehow that affected the running process. She tells me that by now some files should be copied by the process, but so far nothing.

I was wondering, what else could I do to check if the process is properly running? What I did so far:

Just to clarify, the script tt13.sh calls the script prog.sh that runs the program ca. The three calls were made for three of the computer's cores.

$ htop -u katia

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                                 
 2464 katia     20   0 2059m 2.0g  624 R  99.8 26.1  28879:00 ca                                                                      
 2469 katia     20   0 2058m 2.0g  624 R  99.8 26.1  28879:04 ca                                                                      
 2459 katia     20   0 2058m 2.0g  624 R  99.5 26.1  28879:06 ca                                                                      
 2455 katia     20   0 16540 1444 1228 S   0.0  0.0   0:00.00 tt13.sh                                                                 
 2458 katia     20   0 16536 1396 1176 S   0.0  0.0   0:00.00 bash                                                                    
 2460 katia     20   0 16540 1448 1228 S   0.0  0.0   0:00.00 tt13.sh                                                                 
 2463 katia     20   0 16536 1392 1176 S   0.0  0.0   0:00.00 bash                                                                    
 2465 katia     20   0 16540 1448 1228 S   0.0  0.0   0:00.00 tt13.sh                                                                 
 2468 katia     20   0 16536 1392 1176 S   0.0  0.0   0:00.00 bash   


$ lsof -p 2459
ca      2459 katia  cwd    DIR    8,5     4096 3670017 /tmp/program_13
ca      2459 katia  rtd    DIR    8,1     4096       2 /
ca      2459 katia  txt    REG    8,5    27897 3670034 /tmp/program_13/ca
ca      2459 katia  mem    REG    8,1  1811160  130374 /lib/x86_64-linux-gnu/libc-2.15.so
ca      2459 katia  mem    REG    8,1  1030536  130398 /lib/x86_64-linux-gnu/libm-2.15.so
ca      2459 katia  mem    REG    8,1   149312  130622 /lib/x86_64-linux-gnu/ld-2.15.so
ca      2459 katia    0w   CHR    1,3      0t0    3076 /dev/null
ca      2459 katia    1w   REG    8,5        0 5242882 /tmp/results13/251.out
ca      2459 katia    2w   REG    8,1     1059  130681 /home/katia/nohup.out
ca      2459 katia    4w   REG    8,5        0 3670036 /tmp/program_13/basi251.out (deleted)


$ ls -l /proc/2459/fd
total 0
l-wx------ 1 katia katia 64 Jul  7 21:47 0 -> /dev/null
l-wx------ 1 katia katia 64 Jul  7 21:47 1 -> /tmp/results13/251.out
l-wx------ 1 katia katia 64 Jun 17 19:00 2 -> /home/katia/nohup.out
l-wx------ 1 katia katia 64 Jul  7 21:47 4 -> /tmp/program_13/basi251.out (deleted)

What that "deleted" means? Also, what eles could I do to check the process health?

Any other idea?



1 回答 1


如果您有该程序的源代码,并且该程序是使用调试信息编译的,您可以使用gdb -p pid /path/to/executable. 有了它,您可以四处寻找,看看程序的内部状态是否符合您的预期。满意后,您可以从流程中分离出来,它将从中断的地方恢复执行。

至于“删除”文件:在 UNIX 和 Linux 中,它是完全合法的,并且对于open一个新的临时文件很常见,然后立即删除unlink。由于文件系统 inode 的工作方式,只要进程打开该文件,该文件就会继续存在。但是,它没有目录条目,您可以通过它访问它;它只能由该打开的文件句柄使用。当进程关闭文件(或进程退出)时,文件的内容也会消失。

于 2013-07-08T01:12:38.937 回答