amazon-web-services - Incremental Load in Redshift

Question

We are currently working on loading data into Redshift. We have different scenarios here. If the OLTP database is SQL Server residing on premise, then we can consider tool like Attunity that can help loading data to Redshift via S3. Attunity is smart in CDC, that identifies changes reading transaction log, and can apply changes to target accordingly. But this kind of tool is poor in applying transformation logic during the ETL process. Attunity is not a replacement of SSIS or ODI, but good in extracting and loading data from various sources. So for doing the transformation we need a proper ETL tool. We can load data using Attunity in a staging area inside Redshift, and from staging area we can load data to target tables using another ETL tool or using Triggers. As trigger is not supported in Redshift, so what could be that ETL tool? We have not found anything other than AWS Data Pipeline here. But using two tools: Attunity and AWS Data Pipeline might get costly. Is there any other alternative way? We don’t think Data Pipeline can connect to on premise SQL Server. It is only for Amazon ecosystem.

Now let’s consider our on premise SQL Server is now deployed in Amazon RDS. Then the situation might get different. We can still follow the same ETL process described above: using two tools Attunity and AWS Data Pipeline. But this time it should be easier to use only one tool: AWS Data Pipeline. Now is AWS Data Pipeline capable enough to handle all scenarios? We don’t find it can read transaction log. But we should be able to apply other approaches for incremental load. A very common approach is to consider last modified date column with each source table. Then we can identify the rows in RDS Sql Server tables, which are modified from the last load time. But, we cannot take the changed data from RDS to Redshift directly. We will have to use either S3 or DynamoDB. We can make AWS Data Pipeline to use S3 as the route. It again seems like a headache. Maybe there could be some other easier approach. Now again, AWS Data Pipeline is quite new in the competitive market. And a very big limitation to this tool is inability to load data from different sources outside AWS (say Salesforce, Oracle, etc). Is there any other easy to use tool that can work flawlessly inside AWS ecosystem without any difficulty and causing minimal cost?

score 0 · Accepted Answer

我会依靠 Attunity 将您的 OLTP 数据带入暂存区域，因为它非常擅长管理管道的这一部分（尽管您必须使用 repctl 构建大量自己的监控）并且可以非常具有成本效益解决传统上构建起来非常昂贵的这部分 ETL。Pentaho DI 作为 ETL 工具来运行 ETL 过程的程序组件是一个不错的选择，因为您可以构建（尽管它有一些内置）“触发器”来监视数据库表、文件系统、ftp 站点、队列等。 . 并让它们运行几乎任何你想要的过程。有一个不错的社区版，其中包含大部分细节，购买 EE 版本对于支持和调度程序来说是物有所值的。

score 0 · Accepted Answer

“AWS Data Pipeline 可能会变得昂贵”这项亚马逊服务是免费的。

您可以使用Amazon Workflow Service 来安排 ETL 转换的步骤。

amazon-web-services - Incremental Load in Redshift

2 回答 2

Related

Reference