aws-glue - 即使有书签，AWS 胶水每次都会将完整数据从源复制到目标

Question

我有一个从 aws 胶水控制台中的向导生成的胶水作业。我没有在生成任务时更改默认脚本。它从 posgres 数据库表（源）中获取数据并写入另一个 postgres 数据库（目标）。我在 ide 中选择了启用书签。每当任务运行时，即使源中没有插入、更新或删除，它也会将完整的源数据库表复制到目标表。我了解启用书签后，它应该只复制上次运行时源中的更改，但这并没有发生。因此，如果源表中有 4 行，则每次任务运行时，它都会将所有 4 行添加到目标中，并且目标的行数增加 1。如何让它只处理最后一个源数据的变化跑？更远，它是如何添加书签的？如果在 2 次运行之间修改了一行（更新 sql 语句），它将如何只“更新”正确的行？

score 3 · Accepted Answer

3

只有在两个 S3 端点之间复制数据时，书签才有效。不支持 JDBC/ODBC。

于 2017-12-18T20:24:37.233 回答

score 0 · Accepted Answer

AWS Glue 作业中的书签支持JDBC 连接。如文档中所述，与 jdbc 源一起使用时需要满足一些先决条件。

For JDBC sources, the following rules apply:

 - For each table, AWS Glue uses one or more columns as bookmark keys to determine new and processed data. The bookmark keys combine to form a single compound key.
 - You can specify the columns to use as bookmark keys. If you don't specify bookmark keys, AWS Glue by default uses the primary key as the bookmark key, provided that it is sequentially increasing or decreasing (with no gaps).
 - If user-defined bookmarks keys are used, they must be strictly
   monotonically increasing or decreasing. Gaps are permitted.

这意味着您可以updated_at在源表中有一个列作为书签键。它将单调递增。

还有一点在文档中没有明确提到，但在所有给定示例的 aws 中都得到了实践，目前在使用胶水作业书签时也是如此。

如果您希望使用书签，请始终对数据源使用from_catalog方法。这意味着模式应该已经使用爬虫或手动存在于胶水中。

对于 JDBC 数据库，您必须首先创建一个连接，然后使用胶水爬虫创建一个表（尚无法手动创建 JDBC 表）

如果您使用from_options方法进行摄取，遗憾的是胶水书签将不起作用。我通过 S3 数据源学到了这一点。

score -1 · Accepted Answer

我最近发布了一篇关于使用 AWS Glue 触发器构建和自动化数据目录和 ETL 作业的无服务器数据湖的博客。您可以在 Cloud-formation 模板和 p 中找到所有代码

https://aws.amazon.com/blogs/big-data/build-and-automate-a-serverless-data-lake-using-an-aws-glue-trigger-for-the-data-catalog-and- etl-工作/

aws-glue - 即使有书签，AWS 胶水每次都会将完整数据从源复制到目标

3 回答 3

Related

Reference