I have a typical AWS Glue-generated script that loads data from an S3 bucket to my Aurora database available through a JDBC Connection. For reference, it looks like this:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "dev-db",
table_name = "attributes", transformation_ctx = "datasource0")
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings =
[("id", "long", "id", "long"), ("value", "string", "value", "string"),
("name", "string", "name", "string")], transformation_ctx = "applymapping1")
resolvechoice2 = ResolveChoice.apply(frame = applymapping1,
choice = "make_cols", transformation_ctx = "resolvechoice2")
dropnullfields3 = DropNullFields.apply(frame = resolvechoice2,
transformation_ctx = "dropnullfields3")
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = dropnullfields3,
catalog_connection = "local-dev-aurora",
connection_options = {"dbtable": "attributes", "database": "local-dev-db"},
transformation_ctx = "datasink4")
job.commit()
The script above creates the table in database in question and loads csv data from bucket in it. The imported data is very large and I need then to attach the usual index to the RDS database table.
How I can specify that the id
from the mapping (or, alternatively, a combination of fields) would be an index? Could I do it using the Python Glue functions or is it necessary to connect to database after the
job.commit() and additionally add the indexes?