1

我正在尝试用 Python 并行化在许多文档上更正文本的任务,所以我自然而然地找到了“joblib”。我希望每项任务都是更正给定的文档。这是代码的结构:

if __name__ == '__main__':
    lexicon = build_compact_lexicon()

    from joblib import Parallel, delayed
    import multiprocessing

    num_cores = multiprocessing.cpu_count()
    results = Parallel(n_jobs=num_cores)(delayed(find_errors)('GDL', i, 1, lexicon) for i in range(1798, 1820))

我正在使用这里总结的函数 find_errors :

def find_errors(newspaper, year, month, lexicon):
    # parse the input newspaper text data using etree parser from LXML
    # detect errors in the text
    return found_errors_type1, found_errors_type2, found_errors_type3

这确实给我带来了一些错误

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 130, in __call__
    return self.func(*args, **kwargs)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 72, in __call__
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 72, in <listcomp>
    return [func(*args, **kwargs) for func, args, kwargs in self.items]
  File "hellowordParallel.py", line 85, in find_errors
    tree = etree.parse(xml_file_path)
  File "src/lxml/lxml.etree.pyx", line 3427, in lxml.etree.parse (src/lxml/lxml.etree.c:79801)
  File "src/lxml/parser.pxi", line 1805, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:116293)
TypeError: cannot parse from 'NoneType'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 392, in find_cookie
    line_string = line.decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb0 in position 24: invalid start byte

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 139, in __call__
    tb_offset=1)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/format_stack.py", line 373, in format_exc
    frames = format_records(records)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/format_stack.py", line 274, in format_records
    for token in generate_tokens(linereader):
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 514, in _tokenize
    line = readline()
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/format_stack.py", line 265, in linereader
    line = getline(file, lnum[0])
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/linecache.py", line 16, in getline
    lines = getlines(filename, module_globals)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/linecache.py", line 47, in getlines
    return updatecache(filename, module_globals)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/linecache.py", line 136, in updatecache
    with tokenize.open(fullname) as fp:
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 456, in open
    encoding, lines = detect_encoding(buffer.readline)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 433, in detect_encoding
    encoding = find_cookie(first)
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/tokenize.py", line 397, in find_cookie
    raise SyntaxError(msg)
  File "<string>", line None
SyntaxError: invalid or missing encoding declaration for '/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/lxml/etree.so'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "hellowordParallel.py", line 160, in <module>
    results = Parallel(n_jobs=num_cores)(delayed(find_errors)('GDL', i, 1, lexicon) for i in range(1798, 1820))
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 810, in __call__
    self.retrieve()
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/joblib/parallel.py", line 727, in retrieve
    self._output.extend(job.get())
  File "/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/multiprocessing/pool.py", line 608, in get
    raise self._value
SyntaxError: invalid or missing encoding declaration for '/home/mbl/anaconda3/envs/OCR_Correction/lib/python3.5/site-packages/lxml/etree.so'

我不明白这是否应该做与配置相关的事情,或者我的函数是否不适合并行实现......(我想它应该......)

你们中的一些人以前发生过吗?

希望我的问题很清楚,并且有足够的信息可以让某人给我一些帮助!

4

0 回答 0