我想使用 google colabs 的 pyspark 连接到 cassandra。我已经编写了以下代码,下载了 spark 文件并使用 java 将其设置为路径变量。以下是代码:
!wget https://downloads.apache.org/spark/spark-3.1.2/spark-3.1.2-bin-hadoop3.2.tgz
!tar -xvzf spark-3.1.2-bin-hadoop3.2.tgz
!pip install findspark
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop3.2"
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars com.datastax.spark:spark-cassandra-connector_2.12:3.1.0.jar pyspark-shell'
os.environ['SPARK_SUBMIT'] = '--packages com.datastax.spark:spark-cassandra-connector2.12:3.1.0 pyspark-shell'
os.environ['SPARK_HOME']="/content/spark-3.1.2-bin-hadoop3.2"
conf = SparkConf()
conf.setAppName("Spark Cassandra")
conf.set("spark.cassandra.connection.host","host").set("spark.cassandra.auth.username","username").set("spark.cassandra.auth.password","password")
sc = SparkContext(conf=conf)
sql = SQLContext(sc)
dataFrame = sql.read.format("org.apache.spark.sql.cassandra").options(table="table", keyspace="database").load()
dataFrame.printSchema()
当我执行它时,它会创建上下文会话,但会显示“org.apache.spark.sql.cassandra”这个错误。我想我必须单独下载连接器并包含在我的路径中,或者我已经包含在我的路径中。如果有任何解决方案请帮忙。这是在谷歌colabs