apache-spark - 使用 spark-redshift 插入 Redshift

Question

我正在尝试从 S3（镶木地板文件）插入 Redshift 数据。通过 SQLWorkbench 完成 600 万行需要 46 秒。但是通过连接器 spark-redshift 完成它大约需要 7 分钟。

我正在尝试使用更多节点并获得相同的结果。

有什么建议可以提高使用 spark-redshift 的时间吗？

Spark中的代码：

val df = spark.read.option("basePath", "s3a://parquet/items").parquet("s3a://parquet/items/Year=2017/Month=7/Day=15")

df.write
      .format("com.databricks.spark.redshift")
      .option("url", "jdbc:....")
      .option("dbtable", "items")
      .option("tempdir", "s3a://parquet/temp")
      .option("aws_iam_role", "...")
      .option("sortkeyspec", "SORTKEY(id)")
      .mode(SaveMode.Append)
      .save()

SQLWorkbench (Redshift SQL) 中的代码：

CREATE EXTERNAL TABLE items_schema.parquet_items("id type, column2 type....")
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS PARQUET
LOCATION 's3://parquet/items/Year=2017/Month=7/Day=15';

CREATE TABLE items ("id type, column2 type....");

INSERT INTO items (SELECT * FROM items_schema.parquet_items);

score 2 · Accepted Answer

2

于 2018-02-08T16:39:55.297 回答

score 0 · Accepted Answer

此外，尝试使用 CSV 而不是 Avro（这是默认设置）应该更快：

Redshift 在加载 CSV 时比加载 Avro 文件时要快得多，因此在写入 Redshift 时使用该临时格式可能会大大提高性能。

https://docs.databricks.com/spark/latest/data-sources/aws/amazon-redshift.html

apache-spark - 使用 spark-redshift 插入 Redshift

2 回答 2

Related

Reference