假设有一个场景,数据加载到事实表\维度表中,经过分析发现有1亿条记录被错误加载,我需要执行哪些步骤才能正确清理数据。
1 回答
Here are two practices which help in that scenario:
Take a backup or snapshot before each batch. In the case of a major error like this you can roll back to the snapshot, reload and process the correct data.
Maintain an insert-only persistent staging area in the DW, such as a data vault, with each row stamped with a batch ID and timestamp. Remove the rows in error, and rebuild your facts and dimensions.
If this represents a real situation your only chance is #1.
If you don't have a reliable backup, and you have updated and/or deleted rows during the ETL/ELT process, you don't have any record of the pre-fail state and it may be impossible to go back.