嗨,当我在 Dataproc 集群上运行 pyspark 作业时出现错误我有 Dockerfile,我在其中加载依赖项,例如:货币
'''main.py:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("docker-numpy").getOrCreate()
sc = spark.sparkContext
import numpy as np
a = np.arange(15).reshape(3, 5)
print(a)
from currencies import Currency
currency = Currency('USD')
result = currency.get_money_format(13)
print(result)'''
当我运行 pyspark 命令时,尽管添加了 docker 映像运行时,但我遇到了依赖问题(未找到模块,但我想从 docker 映像加载依赖项):
PROPS="^#^spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker"
PROPS="${PROPS}#spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$IMAGE_ID"
PROPS="${PROPS}#spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=$MOUNTS"
PROPS="${PROPS}#spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker"
PROPS="${PROPS}#spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$IMAGE_ID"
PROPS="${PROPS}#spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=$MOUNTS"
PROPS="${PROPS}#spark.executor.cores=2"
PROPS="${PROPS}#spark.submit.deployMode=cluster"
我正在使用上述属性运行以下命令
gcloud dataproc --project ${PROJECT} jobs submit pyspark \
--cluster=${CLUSTER_NAME} \
--region=${REGION} \
--properties ${PROPS} \
main.py
一旦我提交我的工作,我会收到以下错误
错误:
LogAggregationType: AGGREGATED
=========================================================================================================================================
LogType:stderr
LogLastModifiedTime:Mon Dec 27 12:59:34 +0000 2021
LogLength:1496
LogContents:
21/12/27 12:59:30 INFO org.sparkproject.jetty.util.log: Logging initialized @2940ms to org.sparkproject.jetty.util.log.Slf4jLog
21/12/27 12:59:30 INFO org.sparkproject.jetty.server.Server: jetty-9.4.40.v20210413; built: 2021-04-13T20:42:42.668Z; git: b881a572662e1943a14ae12e7e1207989f218b74; jvm 1.8.0_292-b10
21/12/27 12:59:30 INFO org.sparkproject.jetty.server.Server: Started @3042ms
21/12/27 12:59:30 INFO org.sparkproject.jetty.server.AbstractConnector: Started ServerConnector@1a966f2e{HTTP/1.1, (http/1.1)}{0.0.0.0:36491}
21/12/27 12:59:32 INFO com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl: Ignoring exception of type GoogleJsonResponseException; verified object already exists with desired state.
21/12/27 12:59:32 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
21/12/27 12:59:32 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at cluster-with-docker-master-m/10.148.0.5:8030
21/12/27 12:59:33 INFO org.apache.hadoop.conf.Configuration: resource-types.xml not found
21/12/27 12:59:33 INFO org.apache.hadoop.yarn.util.resource.ResourceUtils: Unable to find 'resource-types.xml'.
21/12/27 12:59:34 ERROR org.apache.spark.deploy.yarn.ApplicationMaster: User application exited with status 1
21/12/27 12:59:34 INFO org.sparkproject.jetty.server.AbstractConnector: Stopped Spark@1a966f2e{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
End of LogType:stderr
LogType:stdout
LogLastModifiedTime:Mon Dec 27 12:59:34 +0000 2021
LogLength:214
LogContents:
[[ 0 1 2 3 4]
[ 5 6 7 8 9]
[10 11 12 13 14]]
Traceback (most recent call last):
File "main.py", line 9, in <module>
from currencies import Currency
ModuleNotFoundError: No module named 'currencies'
End of LogType:stdout
Anyone can help? Thanks in advance