postgresql - 使用 spark 从 Postgres 并行读取数据，用于没有整数主键列的表

Question

我正在努力从 PostGres 表中读取数据，该表包含特定季度的 1.02 亿条记录。该表包含多个季度的数据。现在我正在通过 spark JDBC 连接器读取数据，获取数据需要花费太多时间。当我对数据框执行操作（如 Count()）时，加载数据几乎需要 15-20 分钟。数据正在加载或在单个任务上处理，所以我想并行处理/读取数据。

我正在使用下面的代码来获取数据并创建连接：

import java.util.Properties
//function to get the connection properties
def getDbConnectionProperties(environment:Int) : Properties = {

val connectionProps = new Properties()
connectionProps.setProperty("user", user)
 connectionProps.setProperty("password", password )
connectionProps.setProperty("driver", "org.postgresql.Driver")
connectionProps.setProperty("stringtype", "unspecified")  //to save the records with the UUID type which are string in dataframe schema
connectionProps
}
val jdbcurl= "jdbc:postgresql://xxxxx : 5432/test"
val connectionString = jdbcurl;
val connectionProps = getDbConnectionProperties(environment)
val readPGSqlData =  spark.read.jdbc(connectionString,_:String,connectionProps)
val query = s"""(select Column_names from TableName where Period= "2020Q1") a"""
val PGExistingRecords = readPGSqlData(existingRecordsQuery)
PGExistingRecords.count()  //takes 15-20 minutes

我知道如果您指定分区列并指定下限和上限并且分区列需要是整数，我们可以并行读取数据，但在我的情况下，我没有任何整数类型的列。主键也是 GUID 类型。

我可以更快地读取数据或读取并行任务数据的任何方式对我都有帮助。关于我是否可以使用任何具有该功能的第三方或我可以使用本机 JDBC 连接器的任何方式的任何建议。

score 0 · Accepted Answer

对于 GUID 类型，您可以按第一个字符拆分数据：

val tableData =
  List("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", "F")
    .map{prefix =>
      val sql = "(select * from table_name where guid_col like '%s%s') t".format(prefix, "%")
      spark.read.jdbc(url = url, table = sql, properties = connectionProps)
    }
    .reduce(_.union(_))

性能说明： 为了获得最佳性能，您应该在 GUID 列或包含所有其他列的非聚集索引上具有聚集索引。因此，所有读取线程都将使用 Index Seek 和顺序 I/O，否则可能会导致对每个线程进行全表扫描或随机 I/O，这可能比在一个线程中读取表要慢。

postgresql - 使用 spark 从 Postgres 并行读取数据，用于没有整数主键列的表

1 回答 1

Related

Reference