我正在使用 Spark 1.6.2 在 HDP 2.4.3 上运行 Vora 1.3。
我有两张具有相同架构数据的表,一张位于 HANA 数据库中,另一张以 CSV 文件形式存储在 HDFS 中。
我使用 Zeppelin 在 Vora 中创建了两个表:
CREATE TABLE flights_2006 (Year int, Month_ int, DayofMonth int, DayOfWeek int, DepTime int, CRSDepTime int, ArrTime int, CRSArrTime int, UniqueCarrier string, FlightNum int,
TailNum string, ActualElapsedTime int, CRSElapsedTime int, AirTime int, ArrDelay int, DepDelay int, Origin string, Dest string, Distance int, TaxiIn int, TaxiOut int,
Cancelled int, CancellationCode int, Diverted int, CarrierDelay int, WeatherDelay int, NASDelay int, SecurityDelay int, LateAircraftDelay int)
USING com.sap.spark.vora
OPTIONS (
files "/exch/flights_filtered/part-00000,/exch/flights_filtered/part-00001,/exch/flights_filtered/part-00002,/exch/flights_filtered/part-00003,/exch/flights_filtered/part-00004",
csvdelimiter ","
)
Q1。顺便说一句,当从文件源创建 Vora 表时,什么时候可以只提供目录名称,而不是列出目录中的所有文件?这是非常不切实际的,因为无法预测目录中有多少部分文件。
CREATE TABLE flights_2007
USING com.sap.spark.hana
OPTIONS (
tablepath "XXXXXXXXXXXX",
dbschema "XXXXXXXXXX",
host "XXXXXXXXXXX",
instance "00",
user "XXXXXXXXXXX",
passwd "XXXXXXXXXX"
)
而且我能够从这两个的表连接中产生一个结果(这种连接的业务意义放在一边):
select f7.MONTH, f7.DAYOFMONTH, f7.UNIQUECARRIER, f7.FLIGHTNUM, f7.YEAR, f7.DEPTIME, f6.year, f6.DepTime
from flights_2007 as f7 inner join flights_2006 as f6
on f7.MONTH = f6.Month_ and f7.DAYOFMONTH = f6.DayofMonth and f7.UNIQUECARRIER = f6.UniqueCarrier and f7.FLIGHTNUM = f6.FlightNum
where f7.MONTH = 1 and f7.DAYOFMONTH = 2 and f7.UNIQUECARRIER = 'WN'
然后我尝试在 Vora Modeler 中执行相同的步骤。
Q2。Zeppelin 中的 REGISTER TABLE 为什么不会导致 Vora Modeler 中的表可用?
因此,我在 Vora Modeler 中执行了相同的两个表创建语句,在表名中使用所有大写字母,因为我记得 Vora 之前有一些问题。然后创建一个 Vora 视图作为具有以下条件的两个表的连接:
FLIGHTS_2007.MONTH = FLIGHTS_2006.MONTH_ and
FLIGHTS_2007.DAYOFMONTH = FLIGHTS_2007.DAYOFMONTH and
FLIGHTS_2007.UNIQUECARRIER = FLIGHTS_2006.UNIQUECARRIER and
FLIGHTS_2007.FLIGHTNUM = FLIGHTS_2006.FLIGHTNUM
..并使用了where条件:
FLIGHTS_2007.MONTH = 1 and
FLIGHTS_2007.DAYOFMONTH = 2 and
FLIGHTS_2007.UNIQUECARRIER = 'WN'
该视图预览的预期结果将与基于 Zeppelin 的选择相同。实际结果(前几行):
org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2165.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2165.0 (TID 78743, eba165.extendtec.com.au): com.sap.spark.vora.client.jdbc.VoraJdbcException: [Vora [eba165.extendtec.com.au:34530.1615085]] Unknown error when executing SELECT "FLIGHTS_2006"."FLIGHTNUM", "FLIGHTS_2006"."DEPTIME", "FLIGHTS_2006"."UNIQUECARRIER", "FLIGHTS_2006"."MONTH_", "FLIGHTS_2006"."YEAR" FROM "FLIGHTS_2006": HL(9): Runtime error. (schema error: could not resolve column "FLIGHTS_2006"."YEAR" (sql parse error)) at com.sap.spark.vora.client.jdbc.VoraJdbcClient.liftedTree1$1(VoraJdbcClient.scala:210) at com.sap.spark.vora.client.jdbc.VoraJdbcClient.generateAutocloseableIteratorFromQuery(VoraJdbcClient.scala:187) at com.sap.spark.vora.client.VoraClient$$anonfun$generateAutocloseableIteratorFromQuery$1.apply(VoraClient.scala:363) at com.sap.spark.vora.client.VoraClient$$anonfun$generateAutocloseableIteratorFromQuery$1.apply(VoraClient.scala:363) at scala.util.Try$.apply(Try.scala:161) at com.sap.spark.vora.client.VoraClient.handleExceptions(VoraClient.scala:775) at com.sap.spark.vora.client.VoraClient.generateAutocloseableIteratorFromQuery(VoraClient.scala:362) at com.sap.spark.vora.VoraRDD.compute(voraRDD.scala:54) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:313) at org.apache.spark.rdd.RDD.iterator(RDD.scala:277) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at
Q3。我在 Vora Modeler 中做错了吗?或者它实际上是一个错误?