我经历了漫长的痛苦之路,才找到了一个在这里有效的解决方案。
我正在使用 VS 代码中的本机 jupyter 服务器。在那里,我创建了一个.env
文件:
SPARK_HOME=/home/adam/projects/graph-algorithms-book/spark-3.2.0-bin-hadoop3.2
JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
PYSPARK_SUBMIT_ARGS="--driver-memory 2g --executor-memory 6g --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12 pyspark-shell"
然后在我的 python 笔记本中,我有如下所示的内容:
from pyspark.sql.types import *
from graphframes import *
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.appName('GraphFrames').getOrCreate()
您应该会看到打印出来的代码并相应地获取依赖项。像这样的东西:
:: loading settings :: url = jar:file:/home/adam/projects/graph-algorithms-book/spark-3.2.0-bin-hadoop3.2/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /home/adam/.ivy2/cache
The jars for the packages stored in: /home/adam/.ivy2/jars
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-96a3a1f1-4ea4-4433-856b-042d0269ec1a;1.0
confs: [default]
found graphframes#graphframes;0.8.2-spark3.2-s_2.12 in spark-packages
found org.slf4j#slf4j-api;1.7.16 in central
:: resolution report :: resolve 174ms :: artifacts dl 8ms
:: modules in use:
graphframes#graphframes;0.8.2-spark3.2-s_2.12 from spark-packages in [default]
org.slf4j#slf4j-api;1.7.16 from central in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 2 | 0 | 0 | 0 || 2 | 0 |
---------------------------------------------------------------------
之后,我能够创建一些具有关系的代码:
v = spark.createDataFrame([
("a", "Alice", 34),
("b", "Bob", 36),
("c", "Charlie", 30),
], ["id", "name", "age"])
它应该可以正常工作。请记住对齐所有 pyspark 版本。我必须graphframes
从分叉的 repo 安装正确的版本。PiPy 安装落后于版本,因此我不得不使用PHPirates
repo 进行正确安装。在这里,graphframes 已经编译3.2.0
为pyspark
.
pip install "git+https://github.com/PHPirates/graphframes.git@add-setup.py#egg=graphframes&subdirectory=python"
pip install pyspark==3.2.0