azure - 从数据湖的原始摄取层中的 CSV 文件推断架构的最佳实践？

Question

在数据湖的原始摄取层中推断模式是否有最佳实践（不是模式验证，只是推断数据类型和列名）？

我正在使用 Azure，并希望设计一种方法来验证摄取层下游的架构，因此想要一种从 CSV 推断它以进行验证的方法。

到目前为止，由于标头中的架构，我尝试使用 Azure 数据工厂读取带有整数的 csv 并写入 AVRO，并将其全部存储为字符串。我还尝试使用 Purview 扫描文件（CSV 和 AVRO），但仍然返回所有字符串。

文件格式：NAICS 公司编号、NAICS 公司名称、每个州的列（wa 值为 1 或 0）

我认为显而易见的答案可能是使用 Spark（Databricks），但我想确保我有一个简单/必要的理由来提出这个建议。

编辑：我们需要动态地执行此操作，因为我们将每天运行它并且用于摄取许多 csv（而不仅仅是一个文件）的管道。

score 0 · Accepted Answer

我不确定我是否理解正确，但你可以得到这样的东西。这将导致结构，可用于验证您的文件。

val df = spark.read.format("csv")
     .option("header","true")
     .option("inferSchema","true")
     .load("/FileStore/tables/retail-data/by-day/2010_12_01.csv")

val scheme = df.schema

结果：方案：org.apache.spark.sql.types.StructType = StructType(StructField(InvoiceNo,StringType,true), StructField(StockCode,StringType,true), StructField(Description,StringType,true), StructField(Quantity, IntegerType,true), StructField(InvoiceDate,StringType,true), StructField(UnitPrice,DoubleType,true), StructField(CustomerID,DoubleType,true), StructField(Country,StringType,true))

azure - 从数据湖的原始摄取层中的 CSV 文件推断架构的最佳实践？

1 回答 1

Related

Reference