2

我正在尝试使用 qsub 在 Scientific Linux 集群(使用 Sun Grid Engine)上运行 KNIME 2.11.3 软件,要求 4GB 内存。

使用的Java:

java version "1.8.0_73"
Java(TM) SE Runtime Environment (build 1.8.0_73-b02)
Java HotSpot(TM) 64-Bit Server VM (build 25.73-b02, mixed mode)

问题:KNIME 软件可以正常启动工作流程,但(可能)在加载 Weka 机器学习节点期间软件崩溃。我得到的错误信息如下:

    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGSEGV (0xb) at pc=0x00002b2774bf2c4c, pid=115080, tid=47451179185920
    #
    # JRE version: Java(TM) SE Runtime Environment (7.0_60-b19) (build 1.7.0_60-b19)
    # Java VM: Java HotSpot(TM) 64-Bit Server VM (24.60-b09 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # C  [libc.so.6+0x7fc4c]  cfree+0x1c

会发生什么?(这里是从日志)

#  SIGSEGV (0xb) at pc=0x00002b2774bf2c4c, pid=115080, tid=47451179185920
#
# JRE version: Java(TM) SE Runtime Environment (7.0_60-b19) (build 1.7.0_60-b19)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.60-b09 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C  [libc.so.6+0x7fc4c]  cfree+0x1c
#
# Core dump written. Default location: /exports/eddie3_homes_local/pgrabows/core or core.115080
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
#

---------------  T H R E A D  ---------------

Current thread (0x00002b2820007800):  JavaThread "KNIME-TableIO-1" daemon [_thread_in_vm, id=115118, stack(0x00002b28169df000,0x00002b2816ae0000)]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=1 (SEGV_MAPERR), si_addr=0xfffffffffffffff7

Registers:
RAX=0x0000000000000000, RBX=0x00002b277ff37010, RCX=0x00002b2816adf700, RDX=0x0000000000000001
RSP=0x00002b2816addd28, RBP=0x00002b2774970130, RSI=0x0000000000000001, RDI=0xffffffffffffffff
R8 =0x0000000000000020, R9 =0x0101010101010101, R10=0x0000000000000022, R11=0x00002b2774bfbf1e
R12=0x00002b2816addd50, R13=0x00002b2775d2e860, R14=0x00002b27d40b00e0, R15=0x00002b2816adddd0
RIP=0x00002b2774bf2c4c, EFLAGS=0x0000000000010286, CSGSFS=0x0000000000000033, ERR=0x0000000000000005
  TRAPNO=0x000000000000000e

Top of Stack: (sp=0x00002b2816addd28)
0x00002b2816addd28:   00002b2774970655 00002b277453a000
0x00002b2816addd38:   0000000000000000 00002b27d40b00e0
0x00002b2816addd48:   00002b2774970198 00002b2778002470
0x00002b2816addd58:   00002b27d40b00e0 00002b277574e26d
0x00002b2816addd68:   00002b2774cea5c0 00002b2816adddd0
0x00002b2816addd78:   00002b2778002470 00002b2816addda0
0x00002b2816addd88:   00002b277574e26d 0000000000000006
0x00002b2816addd98:   0000000000000078 00002b2816adde60
0x00002b2816addda8:   00002b27757189f8 00002b277c005310
0x00002b2816adddb8:   00002b27d41387f8 00002b2816addf9f
0x00002b2816adddc8:   00002b27d41387f8 00002b2775d2fa50
0x00002b2816adddd8:   0000005000000000 000000000000002e
0x00002b2816addde8:   0000000000000000 0000000000000000
0x00002b2816adddf8:   00002b27d40affe0 000000000000002e
0x00002b2816adde08:   0000000000000100 0000000000000000
0x00002b2816adde18:   0000000000000000 0000000000000000
0x00002b2816adde28:   00002b27d40afeb0 000000000000002e
0x00002b2816adde38:   00002b27d41387f8 0000000000000000
0x00002b2816adde48:   00002b2820007800 00002b2816addf9f
0x00002b2816adde58:   0000000000000002 00002b2816addec0
0x00002b2816adde68:   00002b277571912e 00002b2820007800
0x00002b2816adde78:   00002b277c01d0bf 00002b27d40affb0
0x00002b2816adde88:   0000000000000000 00002b2816ade098
0x00002b2816adde98:   00002b28200156f0 00002b27d41387f8
0x00002b2816addea8:   00002b2820007800 00002b27d40afea0
0x00002b2816addeb8:   00002b27d41387f8 00002b2816addf20
0x00002b2816addec8:   00002b277571968e 00002b2816addf9f
0x00002b2816added8:   00002b27d40afeb0 00002b27d40b0288
0x00002b2816addee8:   00000000000003d8 00002b2816ade098
0x00002b2816addef8:   00002b2816addf9f 00002b27d41387f8
0x00002b2816addf08:   00002b2820007800 00000000b1e99c80
0x00002b2816addf18:   00002b2820007800 00002b2816addf80 

Instructions: (pc=0x00002b2774bf2c4c)
0x00002b2774bf2c2c:   1f 44 00 00 48 8b 05 b1 a2 33 00 48 8b 00 48 85
0x00002b2774bf2c3c:   c0 0f 85 bf 00 00 00 48 85 ff 0f 84 b4 00 00 00
0x00002b2774bf2c4c:   48 8b 47 f8 48 8d 4f f0 a8 02 75 28 a8 04 48 8d
0x00002b2774bf2c5c:   3d ff aa 33 00 74 0c 48 89 c8 48 25 00 00 00 fc 

Register to memory mapping:

RAX=0x0000000000000000 is an unknown value
RBX=0x00002b277ff37010 is an unknown value
RCX=0x00002b2816adf700 is pointing into the stack for thread: 0x00002b2820007800
RDX=0x0000000000000001 is an unknown value
RSP=0x00002b2816addd28 is pointing into the stack for thread: 0x00002b2820007800
RBP=0x00002b2774970130: <offset 0x1130> in /lib64/libdl.so.2 at 0x00002b277496f000
RSI=0x0000000000000001 is an unknown value
RDI=0xffffffffffffffff is an unknown value
R8 =0x0000000000000020 is an unknown value
R9 =0x0101010101010101 is an unknown value
R10=0x0000000000000022 is an unknown value
R11=0x00002b2774bfbf1e: <offset 0x88f1e> in /lib64/libc.so.6 at 0x00002b2774b73000
R12=0x00002b2816addd50 is pointing into the stack for thread: 0x00002b2820007800
R13=0x00002b2775d2e860: <offset 0xdfa860> in /exports/eddie3_homes_local/pgrabows/usr/bin/knime_2.11.3/jre/lib/amd64/server/libjvm.so at 0x00002b2774f34000
R14=0x00002b27d40b00e0 is an unknown value
R15=0x00002b2816adddd0 is pointing into the stack for thread: 0x00002b2820007800


Stack: [0x00002b28169df000,0x00002b2816ae0000],  sp=0x00002b2816addd28,  free space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libc.so.6+0x7fc4c]  cfree+0x1c

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  sun.management.MemoryImpl.getMemoryPools0()[Ljava/lang/management/MemoryPoolMXBean;+0
j  sun.management.MemoryImpl.getMemoryPools()[Ljava/lang/management/MemoryPoolMXBean;+6
j  sun.management.ManagementFactoryHelper.getMemoryPoolMXBeans()Ljava/util/List;+0
j  java.lang.management.ManagementFactory.getMemoryPoolMXBeans()Ljava/util/List;+0
j  org.knime.core.data.util.memory.MemoryWarningSystem.findTenuredGenPool()Ljava/lang/management/MemoryPoolMXBean;+28
j  org.knime.core.data.util.memory.MemoryWarningSystem.<init>()V+26
j  org.knime.core.data.util.memory.MemoryWarningSystem.getInstance()Lorg/knime/core/data/util/memory/MemoryWarningSystem;+10
j  org.knime.core.data.util.memory.MemoryObjectTracker.<init>()V+23
j  org.knime.core.data.util.memory.MemoryObjectTracker.getInstance()Lorg/knime/core/data/util/memory/MemoryObjectTracker;+10
j  org.knime.core.data.container.Buffer.registerMemoryReleasable()V+21
j  org.knime.core.data.container.Buffer.addRow(Lorg/knime/core/data/DataRow;ZZ)V+104
j  org.knime.core.data.container.DataContainer.addRowToTableWrite(Lorg/knime/core/data/DataRow;)V+344
j  org.knime.core.data.container.DataContainer.access$4(Lorg/knime/core/data/container/DataContainer;Lorg/knime/core/data/DataRow;)V+2
j  org.knime.core.data.container.DataContainer$ASyncWriteCallable.callWithContext()Ljava/lang/Void;+101
j  org.knime.core.data.container.DataContainer$ASyncWriteCallable.call()Ljava/lang/Void;+8
j  org.knime.core.data.container.DataContainer$ASyncWriteCallable.call()Ljava/lang/Object;+1
j  java.util.concurrent.FutureTask.run()V+42
j  java.util.concurrent.ThreadPoolExecutor.runWorker(Ljava/util/concurrent/ThreadPoolExecutor$Worker;)V+95
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+5
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub

编辑:添加整个日志:下载日志文件(DROPBOX)

EDIT2:添加 ulimit 和 PATH 数据

两者的 ulimit 不同。在从节点上:

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256023
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) 1048576
file locks                      (-x) unlimited

在主节点上:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256023
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 524288
cpu time               (seconds, -t) 600
max user processes              (-u) 200
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

在 $LD_LIBRARY_PATH 方面它们之间也存在差异,即主节点在此处有一个附加条目:/exports/applications//gridengine/2011.11p1_155/lib/linux-x64

最终编辑,找到答案:

答案是向集群询问更多 RAM,我在执行 qsub 时使用“-l h_vmem=8G”要求最低 8GB RAM。这很尴尬,因为相同的工作流程在我的具有 4GB RAM 的旧笔记本电脑上正常工作,但在其他地方却产生了如此严重的错误。这也有可能是我们本地集群配置相关的错误。

4

0 回答 0