1

While randomizing datetimes to test a database, I saved them to parquet using pyarrow.parquets' write_table(), then read them back using read_table().

Upon trying to convert to Python datatypes with to_pydict(), I recieved the following error:

---> 81 from_parquet = pq.read_table('parquet_vs_csv').to_pydict()
     82
     83 '''

pyarrow/table.pxi in pyarrow.lib.Table.to_pydict (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:38283)()

pyarrow/table.pxi in pyarrow.lib.Column.to_pylist (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:31782)()

pyarrow/table.pxi in pyarrow.lib.ChunkedArray.to_pylist (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:30410)()

pyarrow/array.pxi in __iter__ (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:25015)()

pyarrow/scalar.pxi in pyarrow.lib.TimestampValue.as_py (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:21082)()

pyarrow/scalar.pxi in pyarrow.lib.lambda5 (/arrow/python/build/temp.linux-x86_64-2.7/lib.cxx:7234)()

pandas/_libs/tslib.pyx in pandas._libs.tslib.Timestamp.__new__ (pandas/_libs/tslib.c:10051)()

pandas/_libs/tslib.pyx in pandas._libs.tslib.convert_to_tsobject (pandas/_libs/tslib.c:27665)()

OverflowError: Python int too large to convert to C long

I played around, and this happens for datetimes with year larger then 2700 or so (This was at work and this is a larger number, forgot the exact one that was lower).

I'm new to pyarrow, is this expected behavior?

4

1 回答 1

1

The underlying problem here is that Pandas represents a datetime with nanoseconds since 1970. The time around the year 2700 is simply the limitation that there the number of nanoseconds-since-1970 exceeds the space that can be represented with an int64.

In Arrow, you can represent these dates by using a more granular representation like milliseconds-since-1970 but on the conversion to Pandas, they are always casted to nanoseconds-since-1970 and thus this date cannot be represented in Pandas.

于 2017-12-28T07:41:34.390 回答