我从 HDFS 系统中读取了 parquet 文件:
path<-"hdfs://part_2015"
AppDF <- parquetFile(sqlContext, path)
printSchema(AppDF)
root
|-- app: binary (nullable = true)
|-- category: binary (nullable = true)
|-- date: binary (nullable = true)
|-- user: binary (nullable = true)
class(AppDF)
[1] "DataFrame"
attr(,"package")
[1] "SparkR"
collect(AppDF)
.....error:
arguments imply differing number of rows: 46021, 39175, 62744, 27137
head(AppDF)
.....error:
arguments imply differing number of rows: 36, 30, 48
我读过一些关于这个问题的帖子。但这不是我的情况。事实上,我只是从 parquet 文件中读取了一个表,head()
或者collect()
它。我的拼花桌是这样的:
app category date user
aaa test 20150101 123
aaa test 20150102 345
aaa test 20150103 678
aaaa testA 20150104 123
aaaa testA 20150105 234
aaaa testA 20150106 4345
bbbb testB 20150101 5435
我正在使用 spark-1.4.0-bin-hadoop2.6 我通过使用在集群上运行它
./sparkR --master yarn--client
我在本地也试过了,同样的问题。
showDF(AppDF)
+-----------+-----------+-----------+-----------+
| app| category| date| user|
+-----------+-----------+-----------+-----------+
|[B@217fa749|[B@43bfbacd|[B@60810b7a|[B@3818a815|
|[B@5ac31778|[B@3e39f5d5|[B@4f3a92dd| [B@e8013ce|
|[B@7a9440d1|[B@1b2b9836|[B@4b160f29|[B@153d7342|
|[B@7559fcf2|[B@66edb00e|[B@7ec19bec|[B@58e3e3f7|
|[B@598b9ab8|[B@5c5ad3f5|[B@4f11a931|[B@107af885|
|[B@7951ec36|[B@716b0b73|[B@2abce531|[B@576b09e2|
|[B@34560144|[B@7a6d3233|[B@16faf110|[B@34e85d39|
| [B@3406452|[B@787a4528|[B@235282e3|[B@7e0f1732|
|[B@10bc1446|[B@2bd7083f|[B@325e7695|[B@57bb4a08|
|[B@48f98037|[B@7450c04e|[B@61817c8a|[B@7c177a08|
|[B@694ce2dd|[B@36c2512d| [B@f5f7d71|[B@46248d99|
|[B@479dee25|[B@517de3de|[B@1ffb2d9e|[B@236ff079|
|[B@52ac196f|[B@20b9f0d0| [B@f70f879|[B@41c8d7da|
|[B@68d34af3| [B@7ddcd49|[B@72d077a7|[B@545fafd4|
|[B@5610b292|[B@623bbb62|[B@3f8b5150|[B@53877bc7|
|[B@63cf70a8|[B@47ed58c9|[B@2f601903|[B@4e0a2c41|
|[B@7ddf876d|[B@5e3445aa|[B@39c9cc37|[B@6f7e4c84|
|[B@4cd1a74b|[B@583e5453|[B@64124267|[B@6ac5ab84|
|[B@577f9ddf|[B@7b55c859|[B@3cd48a51|[B@25c4eb0a|
|[B@2322f0e5|[B@4af55c68|[B@3285d64a|[B@70b7ae2f|
+-----------+-----------+-----------+-----------+
我还尝试在 Scala 中读取这个 parquet 文件。并执行 collect() 操作。似乎一切正常。所以这应该是 SparkR 特有的问题