2

Right now I have one test data which have 1 partition and inside that partition it has 2 parquet files

If I read data as:

val df = spark.read.format("delta").load("./test1510/table@v1")

Then I get latest data with 10,000 rows and if I read:

val df = spark.read.format("delta").load("./test1510/table@v0")

Then I get 612 rows, now my question is: How can I view only those new rows which were added in version 1 which is 10,000 - 612 = 9388 rows only

In short at each version I just want to view which data changed. Overall in delta log I am able to see json files and inside there json file I can see that it create separate parquet file at each version but how can I view it in code ?

I am using Spark with Scala

4

1 回答 1

3

您甚至不需要进入parquet文件级别。您可以简单地使用 SQL 查询来实现这一点。

%sql 
SELECT * FROM test_delta VERSION AS OF 2 minus SELECT * FROM test_delta VERSION AS OF 1

上面的代码将为您提供版本 2 中新添加的行,这些行不在版本 1 中

在您的情况下,您可以执行以下操作

val df1 = spark.read.format("delta").load("./test1510/table@v1")
val df2 = spark.read.format("delta").load("./test1510/table@v0")
display(df2.except(df1))
于 2020-02-03T17:08:56.163 回答