python-3.x - 如何通过 boto 库为 EMR 集群选项配置“使用 AWS Glue 数据目录获取表元数据”？

Question

我正在尝试通过使用 python boto 库编写 AWS lambda 函数来创建 EMR 集群。但是我能够创建集群，但我想使用“AWS Glue 数据目录用于表元数据”，以便我可以使用 spark 直接读取来自胶水数据目录。通过 AWS 用户界面创建 EMR 集群时，我通常选中一个复选框（“使用 AWS Glue 数据目录获取表元数据”），这解决了我的目的。但我不知道如何实现同样通过 boto 库。

下面是我用来创建 EMR 集群的 python 代码。

    try:
        connection = boto3.client(
            'emr',
            region_name='xxx'
        )
        cluster_id = connection.run_job_flow(
            Name='EMR-LogProcessing',
            LogUri='s3://somepath/',
            ReleaseLabel='emr-5.21.0',
            Applications=[
                {
                    'Name': 'Spark'
                },
            ],
            Instances={
                'InstanceGroups': [
                    {
                        'Name': "MasterNode",
                        'Market': 'SPOT',
                        'InstanceRole': 'MASTER',
                        'BidPrice': 'xxx',
                        'InstanceType': 'm3.xlarge',
                        'InstanceCount': 1,
                    },
                    {
                        'Name': "SlaveNode",
                        'Market': 'SPOT',
                        'InstanceRole': 'CORE',
                        'BidPrice': 'xxx',
                        'InstanceType': 'm3.xlarge',
                        'InstanceCount': 2,
                    }
                ],
                'Ec2KeyName': 'xxx',
                'KeepJobFlowAliveWhenNoSteps': True,
                'TerminationProtected': False
            },
            VisibleToAllUsers=True,
            JobFlowRole='EMR_EC2_DefaultRole',
            ServiceRole='EMR_DefaultRole',
            Tags=[
                {
                    'Key': 'Name',
                    'Value': 'EMR-LogProcessing',
                },
                {
                    'Key': 'env',
                    'Value': 'dev',
                },
            ],
        )

        print('cluster created with the step...', cluster_id['JobFlowId'])
    except Exception as exp:
        logger.info("Exception Occured in createEMRcluster!!! %s", str(exp))

我没有找到任何线索如何实现它。请帮忙。

score 3 · Accepted Answer

使用 hive-site 配置分类指定 hive.metastore.client.factory.class 的值

[
  {
    "Classification": "hive-site",
    "Properties": {
      "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
    }
  }
]

上面的代码片段可以使用配置属性传递给 boto 的 run_job_flow 功能。

参考： https ://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html

python-3.x - 如何通过 boto 库为 EMR 集群选项配置“使用 AWS Glue 数据目录获取表元数据”？

1 回答 1

Related

Reference