0

我正在尝试从单个 Pyspark 转换中的单个目录读取(全部或多个)数据集。是否可以迭代路径中的所有数据集,而不将单个数据集硬编码为输入?

我想从多个数据集中动态获取不同的列,而不必对单个输入数据集进行硬编码。

4

1 回答 1

0

So this doesn't work since you will have inconsistent results every time you run CI. This will break TLLV (transforms level logic versioning) by making it impossible to tell when logic actually has changed, thus marking a dataset as stale.

You will have to write out the logical paths of each dataset you wish to transform, even if it means they are passed into a generated transform. There will need to be at least some consistent record of which datasets were targeted by which commit.

Another tactic to achieve what you're looking for is to make a single long dataset that is the unpivoted version of the datasets. In this way, you could simple APPEND new rows / files to this dataset, which would let you accept arbitrary inputs, assuming your transform is constructed in such a way to handle this. My rule of thumb is this: if you need dynamic schemas or dynamic counts of datasets, then you're better off using dynamic files / row counts in a single dataset.

于 2020-09-21T17:05:48.253 回答