0

我正在使用带有 Azure Databricks Delta 的 Spark Structured Streaming,我在其中写入 Delta 表(增量表名称是原始的)。我正在从 Azure 文件中读取我收到乱序数据的地方,并且其中有 2 列“ smtUidNr”和“ msgTs“。我正在尝试通过在我的代码中使用 Upsert 来处理重复项,但是当我查询我的增量表“ raw”时。我在增量表中看到以下重复记录

    smtUidNr                                 msgTs
    57A94ADA218547DC8AE2F3E7FB14339D    2019-08-26T08:58:46.000+0000
    57A94ADA218547DC8AE2F3E7FB14339D    2019-08-26T08:58:46.000+0000
    57A94ADA218547DC8AE2F3E7FB14339D    2019-08-26T08:58:46.000+0000

以下是我的代码:

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.functions._


// merge duplicates
def upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {


  microBatchOutputDF.createOrReplaceTempView("updates")


  microBatchOutputDF.sparkSession.sql(s"""
    MERGE INTO raw t
    USING updates s
    ON (s.smtUidNr = t.smtUidNr and s.msgTs>t.msgTs) 
    WHEN MATCHED THEN UPDATE SET * 
    WHEN NOT MATCHED THEN INSERT *
  """)
}


val df=spark.readStream.format("delta").load("abfss://abc@hjklinfo.dfs.core.windows.net/entrypacket/")
df.createOrReplaceTempView("table1")
val entrypacket_DF=spark.sql("""SELECT details as dcl,invdetails as inv,eventdetails as evt,smtdetails as smt,msgHdr.msgTs,msgHdr.msgInfSrcCd FROM table1 LATERAL VIEW explode(dcl) dcl AS details LATERAL VIEW explode(inv) inv AS invdetails LATERAL VIEW explode(evt) evt as eventdetails LATERAL VIEW explode(smt) smt as smtdetails""").dropDuplicates()


entrypacket_DF.createOrReplaceTempView("ucdx")

//Here, we are adding a column date_timestamp which converts msgTs timestamp to YYYYMMDD format in column date_timestamp which eliminates duplicate for today & then we drop this column meaning which we are not tampering with msgTs column
val resultDF=spark.sql("select dcl.smtUidNr,dcl,inv,evt,smt,cast(msgTs as timestamp)msgTs,msgInfSrcCd from ucdx").withColumn("date_timestamp",to_date(col("msgTs"))).dropDuplicates(Seq("smtUidNr","date_timestamp")).drop("date_timestamp")

resultDF.createOrReplaceTempView("final_tab")

val finalDF=spark.sql("select distinct smtUidNr,max(dcl) as dcl,max(inv) as inv,max(evt) as evt,max(smt) as smt,max(msgTs) as msgTs,max(msgInfSrcCd) as msgInfSrcCd from final_tab group by smtUidNr")


finalDF.writeStream.format("delta").foreachBatch(upsertToDelta _).outputMode("update").start()

结构化流不支持聚合、窗口函数和 order by 子句?我可以做些什么来修改我的代码,以便我只能拥有特定 smtUidNr 的 1 条记录?

4

2 回答 2

1

您需要做的是在foreachBatch方法中进行重复数据删除,因此您要确保每个批次合并只为每个键写入一个值。

在您的示例中,您将执行以下操作:

def upsertToDelta(microBatchOutputDF: DataFrame, batchId: Long) {

  microBatchOutputDF
    .select('smtUidNr, struct('msgTs, 'dcl, 'inv, 'evt, 'smt, 'msgInfSrcCd).as("cols"))
    .groupBy('smtUidNr)
    .agg(max('cols).as("latest"))
    .select("smtUidNr", "latest.*")
    .createOrReplaceTempView("updates")

  microBatchOutputDF.sparkSession.sql(s"""
    MERGE INTO raw t
    USING updates s
    ON (s.smtUidNr = t.smtUidNr and s.msgTs>t.msgTs) 
    WHEN MATCHED THEN UPDATE SET * 
    WHEN NOT MATCHED THEN INSERT *
  """)
}

finalDF.writeStream.foreachBatch(upsertToDelta _).outputMode("update").start()

您可以在文档中看到更多示例,请点击此处

于 2019-10-18T03:01:33.743 回答
0

如果存在多行相同的唯一 ID,则以下代码段可帮助您查找最新记录。如果多行完全相同,也只拾取一行。

让用于过滤行/记录的唯一键/键为“id”。您有一个“时间戳”列来查找相同 ID 的最新记录。

def upsertToDelta(micro_batch_df, batchId) :
   delta_table = DeltaTable.forName(spark, f'{database}.{table_name}')
   df = micro_batch_df.dropDuplicates(['id']) \
       .withColumn("r", rank().over(Window.partitionBy('id') \
       .orderBy(col('timestamp').desc()))).filter("r==1").drop("r")
   delta_table.alias("t") \
      .merge(df.alias("s"), 's.id = t.id') \
      .whenMatchedUpdateAll() \
      .whenNotMatchedInsertAll() \
      .execute()
final_df.writeStream \
  .foreachBatch(upsertToDelta) \
  .option('checkpointLocation', '/mnt/path/checkpoint') \
  .outputMode('update') \
  .start()
于 2021-04-02T04:34:45.623 回答