PySpark 3.0.1
我在
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader的 jdbc 函数文档中找到,它说:
column – 将用于分区的数字、日期或时间戳类型的列的名称。
我认为它接受一个日期时间列来对查询进行分区。
所以我在EMR-6.2.0
(PySpark 3.0.1)上尝试了这个:
sql_conn_params = get_spark_conn_params() # my function
sql_conn_params['column'] ='EVENT_CAPTURED'
sql_conn_params['numPartitions'] = 8
# sql_conn_params['upperBound'] = datetime.strptime('2016-01-01', '%Y-%m-%d') # another trial
# sql_conn_params['lowerBound'] = datetime.strptime(''2016-01-10', '%Y-%m-%d')
sql_conn_params['upperBound'] = '2016-01-01 00:00:00'
sql_conn_params['lowerBound'] = '2016-01-10 00:00:00'
df = (spark.read.jdbc(
table=tablize(sql),
**sql_conn_params
))
df.show()
我收到了这个错误:
invalid literal for int() with base 10: '2016-01-01 00:00:00'
Traceback (most recent call last):
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 625, in jdbc
return self._df(self._jreader.jdbc(url, table, column, int(lowerBound), int(upperBound),
ValueError: invalid literal for int() with base 10: '2016-01-01 00:00:00'
我在这里查看了源代码 https://github.com/apache/spark/blob/master/python/pyspark/sql/readwriter.py#L865 发现它不支持文档所说的日期时间类型。
我的问题是:
如代码所示,它不支持 PySpark 中的 datetime 类型分区列,但为什么文档说它支持呢?
谢谢,
严