2

我使用命令构建了pyahocorasick库,python setup.py bdist_egg并将其上传到 Spark 以用于我的 PySpark 作业。

pkg_resources.resource_filename()但是出于安全原因,pyahocorasick 中的.so 文件不能通过 Spark 集群上的方法导入。

Traceback (most recent call last):
  File "spark_datawash.py", line 251, in <module>
    import ahocorasick
  File "build/bdist.linux-x86_64/egg/ahocorasick.py", line 7, in <module>
  File "build/bdist.linux-x86_64/egg/ahocorasick.py", line 4, in __bootstrap__
  File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 1152, in resource_filename
  File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 1696, in get_resource_filename
  File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 1726, in _extract_resource
  File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 1219, in get_cache_path
  File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 1199, in extraction_error
pkg_resources.ExtractionError: Can't extract file(s) to egg cache

The following error occurred while trying to extract file(s) to the Python egg
cache:

  [Errno 13] Permission denied: '/home/.python-eggs'

The Python egg cache directory is currently set to:

  /home/.python-eggs

Perhaps your account does not have write access to this directory?  You can
change the cache directory by setting the PYTHON_EGG_CACHE environment
variable to point to an accessible directory.

这就是 pyahocorasick 导入 .so 的方式:

def __bootstrap__():
    global __bootstrap__, __loader__, __file__
    import sys, pkg_resources, imp 
    __file__ = pkg_resources.resource_filename(__name__, 'ahocorasick.so')
    __loader__ = None; del __bootstrap__, __loader__
    imp.load_dynamic(__name__,__file__)
__bootstrap__()

我可以通过resource_stream()而不是resource_filename()或其他方式导入 .so 而无需从绝对文件路径中读取吗?谢谢大家。

顺便说一句,由于其他原因,我无法在 Spark 集群上的每个节点上安装 pyahocorasick。所以我必须上传一个鸡蛋压缩的发行版供以后使用。

4

0 回答 0