Currently we are testing structured streaming Kafka drivers. We submit on YARN(2.7.3) with --packages 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0', without problems. However when we try to launch on spark standalone with deploy mode=cluster, we get the
ClassNotFoundException: Failed to find data source: kafka
error even though the launch command has added the Kafka jars to -Dspark.jars (see below) and subsequent log further states these jars have been successfully added.
All 10 jars exist in /home/spark/.ivy2 on all nodes. I manually checked to see that KafkaSourceProvider
class does exist in the org.apache.spark_spark-sql-kafka-0-10_2.11-2.1.0.jar
. I further confirmed there are no issues with the jars by launching the driver in YARN without the --packages
option and manually adding all 10 jars with --jars option
.
The nodes run Scala 2.11.8.
Any insights appreciated.
The automatically added jars by spark-submit:
-Dspark.jars=file:/home/spark/.ivy2/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.1.0.jar,file:/home/spark/.ivy2/jars/org.apache.kafka_kafka-clients-0.10.0.1.jar,file:/home/spark/.ivy2/jars/org.apache.spark_spark-tags_2.11-2.1.0.jar,file:/home/spark/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar,file:/home/spark/.ivy2/jars/net.jpountz.lz4_lz4-1.3.0.jar,file:/home/spark/.ivy2/jars/org.xerial.snappy_snappy-java-1.1.2.6.jar,file:/home/spark/.ivy2/jars/org.slf4j_slf4j-api-1.7.16.jar,file:/home/spark/.ivy2/jars/org.scalatest_scalatest_2.11-2.2.6.jar,file:/home/spark/.ivy2/jars/org.scala-lang_scala-reflect-2.11.8.jar,file:/home/spark/.ivy2/jars/org.scala-lang.modules_scala-xml_2.11-1.0.2.jar
Spark info messages which appears to have loaded these jars:
17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.1.0.jar at spark://10.102.22.23:50513/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.1.0.jar with timestamp 1485467844922 17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.apache.kafka_kafka-clients-0.10.0.1.jar at spark://10.102.22.23:50513/jars/org.apache.kafka_kafka-clients-0.10.0.1.jar with timestamp 1485467844923 17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.apache.spark_spark-tags_2.11-2.1.0.jar at spark://10.102.22.23:50513/jars/org.apache.spark_spark-tags_2.11-2.1.0.jar with timestamp 1485467844923 17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar at spark://10.102.22.23:50513/jars/org.spark-project.spark_unused-1.0.0.jar with timestamp 1485467844923 17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/net.jpountz.lz4_lz4-1.3.0.jar at spark://10.102.22.23:50513/jars/net.jpountz.lz4_lz4-1.3.0.jar with timestamp 1485467844923 17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.xerial.snappy_snappy-java-1.1.2.6.jar at spark://10.102.22.23:50513/jars/org.xerial.snappy_snappy-java-1.1.2.6.jar with timestamp 1485467844923 17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.slf4j_slf4j-api-1.7.16.jar at spark://10.102.22.23:50513/jars/org.slf4j_slf4j-api-1.7.16.jar with timestamp 1485467844923 17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.scalatest_scalatest_2.11-2.2.6.jar at spark://10.102.22.23:50513/jars/org.scalatest_scalatest_2.11-2.2.6.jar with timestamp 1485467844923 17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.scala-lang_scala-reflect-2.11.8.jar at spark://10.102.22.23:50513/jars/org.scala-lang_scala-reflect-2.11.8.jar with timestamp 1485467844924 17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.scala-lang.modules_scala-xml_2.11-1.0.2.jar at spark://10.102.22.23:50513/jars/org.scala-lang.modules_scala-xml_2.11-1.0.2.jar with timestamp 1485467844924
The error message:
Caused by: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:569) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:197) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87) at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87) at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30) at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124) at com.dematic.labs.analytics.diagnostics.spark.drivers.StructuredStreamingSignalCount$.main(StructuredStreamingSignalCount.scala:76) at com.dematic.labs.analytics.diagnostics.spark.drivers.StructuredStreamingSignalCount.main(StructuredStreamingSignalCount.scala) ... 6 more Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource