numpy - spark-submitting 时没有名为 numpy 的模块

Question

我正在提交一个导入 numpy 的 python 文件，但出现no module named numpy错误。

$ spark-submit --py-files projects/other_requirements.egg projects/jobs/my_numpy_als.py
Traceback (most recent call last):
  File "/usr/local/www/my_numpy_als.py", line 13, in <module>
    from pyspark.mllib.recommendation import ALS
  File "/usr/lib/spark/python/pyspark/mllib/__init__.py", line 24, in <module>
    import numpy
ImportError: No module named numpy

我在想我会为 numpy-python-files 拉一个鸡蛋，但我在弄清楚如何构建那个鸡蛋时遇到了麻烦。但后来我突然想到 pyspark 本身使用 numpy。引入我自己的 numpy 版本会很愚蠢。

对在这里做适当的事情有任何想法吗？

score 4 · Accepted Answer

看起来 Spark 正在使用尚未numpy安装的 Python 版本。这可能是因为您在虚拟环境中工作。

尝试这个：

# The following is for specifying a Python version for PySpark. Here we
# use the currently calling Python version.
# This is handy for when we are using a virtualenv, for example, because
# otherwise Spark would choose the default system Python version.
os.environ['PYSPARK_PYTHON'] = sys.executable

score 1 · Accepted Answer

我通过配置一个包含以下内容（除其他外）的小型引导脚本在所有 emr 节点上安装 numpy 来实现这一点。

#!/bin/bash -xe sudo yum install python-numpy python-scipy -y

然后通过将以下选项添加到 aws emr 命令来配置要在启动集群时执行的引导脚本（以下示例为引导脚本提供了一个参数）

--bootstrap-actions Path=s3://some-bucket/keylocation/bootstrap.sh,Name=setup_dependencies,Args=[s3://some-bucket]

这也可以在从 DataPipeline 自动设置集群时使用。

score 0 · Accepted Answer

有时，当您导入某些库时，您的命名空间会被函数污染numpy。和min等函数特别容易受到这种污染。如有疑问，请定位对这些函数的调用并将这些调用替换为etc。这样做有时会比定位污染源更快。maxsum__builtin__.sum

score 0 · Accepted Answer

确保您指向正确的 Python 版本spark-env.sh。PYSPARK_PATH添加export PYSPARK_PATH=/your_python_exe_path到/conf/spark-env.sh文件。

numpy - spark-submitting 时没有名为 numpy 的模块

4 回答 4

Related

Reference