0

I have a typical AWS Glue-generated script that loads data from an S3 bucket to my Aurora database available through a JDBC Connection. For reference, it looks like this:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev-db", 
    table_name = "attributes", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = 
     [("id", "long", "id", "long"), ("value", "string", "value", "string"), 
     ("name", "string", "name", "string")], transformation_ctx = "applymapping1")

resolvechoice2 = ResolveChoice.apply(frame = applymapping1, 
     choice = "make_cols", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, 
     transformation_ctx = "dropnullfields3")

datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3, 
     catalog_connection = "local-dev-aurora", 
     connection_options = {"dbtable": "attributes", "database": "local-dev-db"}, 
     transformation_ctx = "datasink4")

job.commit()

The script above creates the table in database in question and loads csv data from bucket in it. The imported data is very large and I need then to attach the usual index to the RDS database table.

How I can specify that the id from the mapping (or, alternatively, a combination of fields) would be an index? Could I do it using the Python Glue functions or is it necessary to connect to database after the job.commit() and additionally add the indexes?

4

1 回答 1

0

添加索引是一个 SQL 查询操作,粘合动态框架不会对它做任何事情。

因此,一旦导入数据,就从胶水本身运行创建索引查询。

于 2020-11-25T02:18:13.873 回答