在 Dataproc spark 集群中,graphframe 包在 spark-shell 中可用,但在 jupyter pyspark notebook 中不可用。
Pyspark 内核配置:
PACKAGES_ARG='--packages graphframes:graphframes:0.2.0-spark2.0-s_2.11'
以下是初始化集群的 cmd:
gcloud dataproc clusters create my-dataproc-cluster --properties spark.jars.packages=com.databricks:graphframes:graphframes:0.2.0-spark2.0-s_2.11 --metadata "JUPYTER_PORT=8124,INIT_ACTIONS_REPO=https://github.com/{xyz}/dataproc-initialization-actions.git" --initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh --num-workers 2 --properties spark:spark.executorEnv.PYTHONHASHSEED=0,spark:spark.yarn.am.memory=1024m --worker-machine-type=n1-standard-4 --master-machine-type=n1-standard-4