我正在使用基于python-javabridge的 Weka 的 Python 包装器。我有一项很长的任务要执行,因此,我正在使用Celery来执行此操作。问题是我明白了
A fatal error has been detected by the Java Runtime Environment:
SIGSEGV (0xb) at pc=0x00007fff91a3c16f, pid=11698, tid=3587
JRE version: (8.0_31-b13) (build )
Java VM: Java HotSpot(TM) 64-Bit Server VM (25.31-b07 mixed mode bsd-amd64 compressed oops)
Problematic frame:
C [libdispatch.dylib+0x616f] _dispatch_async_f_slow+0x18b
Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
If you would like to submit a bug report, please visit:
http://bugreport.java.com/bugreport/crash.jsp
The crash happened outside the Java Virtual Machine in native code.
See problematic frame for where to report the bug.
在线程内启动 JVM 时。这两行代码用于这样做(来自weka.core.jvm):
javabridge.start_vm(run_headless=True)
javabridge.attach()
根据我的阅读,这可能是由于 JVM 未附加到 Celery 线程这一事实造成的。然而,javabridge.attach()
确实是里面跑的。
我错过了什么?
编辑:我确定了导致麻烦的代码。它与NLTK标记器有关。以下代码(根据Vebjorn 的回答)将重现该错误:
# hello.py
from nltk.tokenize import RegexpTokenizer
import javabridge
from celery import Celery
app = Celery('hello', broker='amqp://guest@localhost//', backend='amqp')
started = False
@app.task
def hello():
global started
if not started:
print 'Starting the VM'
javabridge.start_vm(run_headless=True)
started = True
sentence = "This is a sentence with some numbers like 1, 2 or and some weird symbols like @, $ or ! :)"
tokenizer = RegexpTokenizer(r'\w+')
tokenized_sentence = tokenizer.tokenize(sentence.lower())
print "Tokens:", tokenized_sentence
return javabridge.run_script('java.lang.String.format("Hello, %s!", greetee);',
dict(greetee='world'))
无需启动 JVM,代码就可以正常运行。当不作为 Celery 任务运行时,它也可以工作。我不明白为什么它会崩溃。
编辑 2:它实际上可以在干净的 Ubuntu 环境(Dockerized)中工作,但不能在 Mac OS X Yosemite (v10.3)上工作。
编辑 3:如评论中所述,如果from nltk.tokenize import RegexpTokenizer
在任务包装器内部完成,即在函数内部完成,它会hello()
起作用。