当我尝试从 csv 文件创建的 DF 写入 avro 文件时,我遇到了 NullPointerException:
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("SparkCsvToAvro")
.master("local")
.getOrCreate();
SQLContext context = new SQLContext(spark);
String path = "C:\\git\\sparkCsvToAvro\\src\\main\\resources";
DataFrameReader read = context.read();
Dataset<Row> csv = read.csv(path);
DataFrameWriter<Row> write = csv.write();
DataFrameWriter<Row> format = write.format("com.databricks.spark.avro");
format.save("C:\\git\\sparkCsvToAvro\\src\\main\\resources\\avro");
}
我的 pom.xml :
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<junit.version>4.12</junit.version>
<spark-core.version>2.1.0</spark-core.version>
<maven-compiler-plugin.version>3.5.1</maven-compiler-plugin.version>
<maven-compiler-plugin.source>1.8</maven-compiler-plugin.source>
<maven-compiler-plugin.target>1.8</maven-compiler-plugin.target>
<spark-avro.version>3.2.0</spark-avro.version>
<spark-csv.version>1.5.0</spark-csv.version>
<spark-sql.version>2.1.0</spark-sql.version>
</properties>
...
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>${maven-compiler-plugin.version}</version>
<configuration>
<source>${maven-compiler-plugin.source}</source>
<target>${maven-compiler-plugin.target}</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark-core.version}</version>
</dependency>
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-avro_2.11</artifactId>
<version>${spark-avro.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark-sql.version}</version>
</dependency>
</dependencies>
和异常堆栈跟踪:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
...
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.lang.NullPointerException
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1012)
at org.apache.hadoop.util.Shell.runCommand(Shell.java:404)
我不知道我做错了什么?也许依赖关系不正确?或者这只是我做的一个坏习惯?
npe在这里:DataFrameWriter<Row> format = write.format("com.databricks.spark.avro");
format.save("C:\\git\\sparkCsvToAvro\\src\\main\\resources\\avro");
“格式”为空,我不知道为什么?