2

我有一个从 spark 以 parquet 格式编写的数据框,其中有一列类型为“vector”的列。在 spark 中打印模式给出以下信息

DataFrame[键:字符串,嵌入:向量]

我在 python pandas 中尝试了以下两种方法

df = pandas.read_parquet("test.parquet", engine='auto') 

which gives the following error

  File "pyarrow/_dataset.pyx", line 399, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 1994, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 122, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Mix of struct and list types not yet supported

df = pandas.read_parquet("test.parquet", engine='fastparquet')

which reads it just fine but with a different schema (possible converting dense vector to sparse)

>>> df.dtypes
key                   object
embedding.type          int8
embedding.size       float64
embedding.indices     object
embedding.values      object

spark中的原始数据框没有稀疏表示

>>> df.take(1)
[Row(key='1', embedding=DenseVector([-0.0414, 0.0767, 0.2612, 0.0443, 0.0744, -0.0956, -0.1825, -0.2083, -0.1557, -0.266, 0.2057, 0.138, 0.2796, 0.205, 0.0058, 0.1772, 0.1278, 0.0435, -0.1279, 0.2087, -0.374, -0.2452, -0.0093, 0.1221, 0.3578, -0.2423, -0.0303, -0.0099, -0.0991, -0.0875, -0.2843, -0.2205, 0.185, -0.111, 0.0886, -0.0833, -0.1093, 0.0568, -0.1098, 0.0313, 0.2832, -0.2354, -0.025, -0.2765, 0.1904, -0.1498, -0.1026, -0.0652, -0.3952, 0.2186, -0.1586, 0.1917, 0.2394, 0.1607, -0.5347, -0.5082, 0.2372, 0.1505, 0.1101, -0.439, -0.1054, 0.0092, -0.2694, -0.204, -0.2395, 0.1067, -0.0903, 0.1318, 0.3564, -0.1658, -0.3167, 0.3514, 0.0994, -0.0031, 0.0449, 0.2622, -0.2936, -0.5026, 0.5166, -0.2375, -0.1766, 0.2452, -0.2143, -0.3317, 0.249, -0.0464, 0.1103, 0.13, -0.0922, -0.1263, 0.443, 0.5802, -0.0656, 0.0532, 0.0589, 0.3378, 0.0513, -0.4131, 0.1765, 0.1331, -0.2858, -0.0871, 0.2262, 0.1019, -0.1508, -0.2853, 0.3168, 0.1593, -0.3701, 0.2883, 0.1121, -0.0968, -0.0344, 0.1632, -0.1378, -0.1383, -0.1744, -0.0442, 0.0378, 0.0212, -0.0548, -0.3263, -0.2908, -0.2052, 0.0434, 0.0069, 0.1091, 0.1618]))]

我可以调整任何参数以使其使用 pyarrow 读取?或在 fastparquet 中使用正确的模式阅读?

4

0 回答 0