data-warehouse - Understanding ETL processes

Question

ETL seems to be a pretty common task. I am basically reading some ETL mistakes which designers make with very large data on http://it.toolbox.com/blogs/infosphere/17-mistakes-that-etl-designers-make-with-very-large-data-19264

I need some practical insights for the following points

a) Incorporating Inserts, Updates, and Deletes in to the same data flow / same process.. How is that a problem?

b) Sourcing multiple systems at the same time, depending on heterogeneous systems of data.

c) Not producing the correct indexes on the sources/ lookups that need to be accessed.

d) Believing that ‘ I need to process all the data in one pass because it’s the fastest way to do it ‘</p>

Any help?

score 3 · Accepted Answer

a) 数据完整性问题

b) 较小块的数据质量将提高，故障减少。

c) 需要更多时间才能完成<

d) 错误的索引会导致更多的时间。最好有基于您正在执行的查询的索引。即语句的where子句中有什么

e) 将数据拆分为更小的数据集并对其进行处理将是一种有效的解决方案
。您的 BITS-PILANI(WILP) 学生仪式。

score 1 · Accepted Answer

A) 如果您发现任务需要很长时间才能完成（由于数据量增加），这是一个问题，然后在技术上将它们分开变得太困难。但是将任务分开会增加数据加载不一致的可能性（即您的 DELETE 工作但您的插入失败，这意味着您丢失了数据加载）

B）我不明白这里的“同时”——你的意思是同时吗？如果您同时尝试从多个系统加载数据，您可以最大限度地利用带宽（网络、磁盘等）。如果您需要在离线时加载该数据，有时您别无选择。

C) 是的，不正确的索引会减慢访问速度。但通常供应商不喜欢您在源数据库中创建索引。

D) 性能调优（最快的方法）是一个复杂的话题。在某些情况下，一次完成可能会更快。在其他情况下可能不会。

2 回答 2