14

每个作业脚本代码都应该结束,job.commit()但是这个函数的具体操作是什么?

  1. 它只是工作结束标记吗?
  2. 它可以在一份工作中被调用两次(如果是 - 在什么情况下)?
  3. 调用后执行任何python语句是否安全job.commit()

PS我没有在PyGlue.zipaws py源代码中找到任何描述:(

4

3 回答 3

17

到今天为止,Job 对象唯一有用的情况是在使用 Job Bookmarks 时。当您从 Amazon S3 读取文件(目前仅支持书签的来源)并调用 yourjob.commit时,到目前为止读取的时间和路径将在内部存储,因此如果由于某种原因您尝试再次读取该路径,您只会得到返回未读(新)文件。

在此代码示例中,我尝试分别读取和处理两个不同的路径,并在处理完每个路径后提交。如果由于某种原因我停止工作,则不会处理相同的文件。

args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)
job.init(args[‘JOB_NAME’], args)

paths = [
    's3://bucket-name/my_partition=apples/',
    's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for path in paths:
    try:
        dynamic_frame = glue_context.create_dynamic_frame_from_options(
            connection_type='s3',
            connection_options={'paths'=[s3_path]},
            format='json',
            transformation_ctx="path={}".format(path))
        do_something(dynamic_frame)
        # Commit file read to Job Bookmark
        job.commit()
    except:
        # Something failed

仅当您启用了 Job Bookmark 时,才能在对象上调用 commit 方法Job,并且存储的引用会从 JobRun 保留到 JobRun,直到您重置或暂停 Job Bookmark。在 a 之后执行更多的 python 语句是完全安全的Job.commit,并且如前面的代码示例所示,多次提交也是有效的。

希望这可以帮助

于 2018-01-16T13:48:39.593 回答
4

扩展@yspotts 答案。可以在 AWS Glue 作业脚本中执行多个job.commit()脚本,但正如他们所提到的,书签只会更新一次。但是job.init(),多次调用也是安全的。在这种情况下,书签将使用自上次提交以来处理的 S3 文件正确更新。如果false,它什么也不做。

init()函数中,有一个“已初始化”标记被更新并设置为true. 然后,在commit()函数中检查此标记,如果true然后执行提交书签并重置“已初始化”标记的步骤。

因此,与@hoaxz 答案不同的唯一方法是调用job.init()for 循环的每次迭代:

args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME'])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)

paths = [
    's3://bucket-name/my_partition=apples/',
    's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for s3_path in paths:
    job.init(args[‘JOB_NAME’], args)
    dynamic_frame = glue_context.create_dynamic_frame_from_options(
        connection_type='s3',
        connection_options={'paths'=[s3_path]},
        format='json',
        transformation_ctx="path={}".format(path))
    do_something(dynamic_frame)
    # Commit file read to Job Bookmark
    job.commit()
于 2018-09-24T11:33:32.583 回答
2

根据 AWS 支持团队的说法,commit不应多次调用。这是我从他们那里得到的确切回复:

The method job.commit() can be called multiple times and it would not throw any error 
as well. However, if job.commit() would be called multiple times in a Glue script 
then job bookmark will be updated only once in a single job run that would be after 
the first time when job.commit() gets called and the other calls for job.commit() 
would be ignored by the bookmark. Hence, job bookmark may get stuck in a loop and 
would not able to work well with multiple job.commit(). Thus, I would recommend you 
to use job.commit() once in the Glue script.
于 2018-07-26T13:59:25.353 回答