pyspark - PySpark JDBC 写入 MySQL (TiDB)

Question

我正在尝试将 pyspark 数据帧（百万行）写入 TIDB，（Spark 2.3）

df.write.format('jdbc').options(
  url='jdbc:mysql://<host>:<port>/<table>',
  driver='com.mysql.jdbc.Driver',
  dbtable='<tablename>',
  user='<username>',
  password='<password>',
  batchsize = 30000,
  truncate = True
).mode('overwrite').save()

但是，我一直得到的只是这个错误

Caused by: java.sql.BatchUpdateException: statement count 5001 exceeds the transaction limitation, autocommit = false
....
....
....
Caused by: java.sql.SQLException: statement count 5001 exceeds the transaction limitation, autocommit = false

知道如何解决这个问题吗？

score 2 · Accepted Answer

您应该添加?rewriteBatchedStatements=true到您的 JDBC URI，以便对 DML 语句进行批处理。不仅写入会更快，而且您不会轻易达到数据库事务限制。

score 0 · Accepted Answer

您可以尝试添加选项将“isolationLevel”设置为none，这将避免事务的限制

df.write.format('jdbc').options(
  url='jdbc:mysql://<host>:<port>/<table>',
  driver='com.mysql.jdbc.Driver',
  dbtable='<tablename>',
  user='<username>',
  password='<password>',
  batchsize = 30000,
  truncate = True,
  isolationLevel = None
).mode('overwrite').save()

pyspark - PySpark JDBC 写入 MySQL (TiDB)

2 回答 2

Related

Reference