1

Currently we are testing structured streaming Kafka drivers. We submit on YARN(2.7.3) with --packages 'org.apache.spark:spark-sql-kafka-0-10_2.11:2.1.0', without problems. However when we try to launch on spark standalone with deploy mode=cluster, we get the

ClassNotFoundException: Failed to find data source: kafka

error even though the launch command has added the Kafka jars to -Dspark.jars (see below) and subsequent log further states these jars have been successfully added.

All 10 jars exist in /home/spark/.ivy2 on all nodes. I manually checked to see that KafkaSourceProvider class does exist in the org.apache.spark_spark-sql-kafka-0-10_2.11-2.1.0.jar. I further confirmed there are no issues with the jars by launching the driver in YARN without the --packages option and manually adding all 10 jars with --jars option. The nodes run Scala 2.11.8.

Any insights appreciated.

  1. The automatically added jars by spark-submit:

    -Dspark.jars=file:/home/spark/.ivy2/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.1.0.jar,file:/home/spark/.ivy2/jars/org.apache.kafka_kafka-clients-0.10.0.1.jar,file:/home/spark/.ivy2/jars/org.apache.spark_spark-tags_2.11-2.1.0.jar,file:/home/spark/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar,file:/home/spark/.ivy2/jars/net.jpountz.lz4_lz4-1.3.0.jar,file:/home/spark/.ivy2/jars/org.xerial.snappy_snappy-java-1.1.2.6.jar,file:/home/spark/.ivy2/jars/org.slf4j_slf4j-api-1.7.16.jar,file:/home/spark/.ivy2/jars/org.scalatest_scalatest_2.11-2.2.6.jar,file:/home/spark/.ivy2/jars/org.scala-lang_scala-reflect-2.11.8.jar,file:/home/spark/.ivy2/jars/org.scala-lang.modules_scala-xml_2.11-1.0.2.jar
    
  2. Spark info messages which appears to have loaded these jars:

    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.1.0.jar at spark://10.102.22.23:50513/jars/org.apache.spark_spark-sql-kafka-0-10_2.11-2.1.0.jar with timestamp 1485467844922
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.apache.kafka_kafka-clients-0.10.0.1.jar at spark://10.102.22.23:50513/jars/org.apache.kafka_kafka-clients-0.10.0.1.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.apache.spark_spark-tags_2.11-2.1.0.jar at spark://10.102.22.23:50513/jars/org.apache.spark_spark-tags_2.11-2.1.0.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.spark-project.spark_unused-1.0.0.jar at spark://10.102.22.23:50513/jars/org.spark-project.spark_unused-1.0.0.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/net.jpountz.lz4_lz4-1.3.0.jar at spark://10.102.22.23:50513/jars/net.jpountz.lz4_lz4-1.3.0.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.xerial.snappy_snappy-java-1.1.2.6.jar at spark://10.102.22.23:50513/jars/org.xerial.snappy_snappy-java-1.1.2.6.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.slf4j_slf4j-api-1.7.16.jar at spark://10.102.22.23:50513/jars/org.slf4j_slf4j-api-1.7.16.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.scalatest_scalatest_2.11-2.2.6.jar at spark://10.102.22.23:50513/jars/org.scalatest_scalatest_2.11-2.2.6.jar with timestamp 1485467844923
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.scala-lang_scala-reflect-2.11.8.jar at spark://10.102.22.23:50513/jars/org.scala-lang_scala-reflect-2.11.8.jar with timestamp 1485467844924
    17/01/26 21:57:24 INFO SparkContext: Added JAR file:/home/spark/.ivy2/jars/org.scala-lang.modules_scala-xml_2.11-1.0.2.jar at spark://10.102.22.23:50513/jars/org.scala-lang.modules_scala-xml_2.11-1.0.2.jar with timestamp 1485467844924
    
  3. The error message:

    Caused by: java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:569)
        at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86)
        at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
        at org.apache.spark.sql.execution.datasources.DataSource.sourceSchema(DataSource.scala:197)
        at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo$lzycompute(DataSource.scala:87)
        at org.apache.spark.sql.execution.datasources.DataSource.sourceInfo(DataSource.scala:87)
        at org.apache.spark.sql.execution.streaming.StreamingRelation$.apply(StreamingRelation.scala:30)
        at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:124)
        at com.dematic.labs.analytics.diagnostics.spark.drivers.StructuredStreamingSignalCount$.main(StructuredStreamingSignalCount.scala:76)
        at com.dematic.labs.analytics.diagnostics.spark.drivers.StructuredStreamingSignalCount.main(StructuredStreamingSignalCount.scala)
        ... 6 more
    Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
    
4

1 回答 1

1

这是一个已知问题。见https://issues.apache.org/jira/browse/SPARK-4160

现在,您可以使用客户端模式作为解决方法。

于 2017-01-31T21:15:44.427 回答