apache-spark - 使用 Deequ 对 Apache Spark 应用程序进行单元测试

问问题 2021-12-06T15:08:59.630

27 次

我即将商业化/生产我的 ML 管道，它会进行一些数据清理和转换。我基本上拥有以下内容，按顺序运行：

    1. Load the CSV data from a remote location
    
    2. If the CSV fails to parse into a schema, fail the pipeline
    
    3. Once Step 2 is successful, remove all rows that are duplicates
    
    4. Impute all NaN values
    
    5. OneHotEncode categorical data

    6. Identify some variance in the columns and apply a threshold to throw away those columns that does not vary that much

我想实施几个额外的步骤。现在我的问题是如何在每个步骤之间进行单元测试。使用传统应用程序，您可以针对代码库运行单元测试，并在成功后打包它们，但在这里我必须首先将它们打包为 Spark 应用程序，但我想在我的每一步之间暂停，我的每一步上面将给我一个新的 DataFrame，我想在每个步骤中使用 Deequ 来检查结果 DataFrame 的期望。

关于我的方法应该是什么的任何想法？

apache-spark - 使用 Deequ 对 Apache Spark 应用程序进行单元测试

0 回答 0

Related

Reference