1

我有一个 spark 结构化流作业,将数据写入 IBM Cloud Object Storage (S3):

dataDf.
  writeStream.
  format("parquet").
  trigger(Trigger.ProcessingTime(trigger_time_ms)).
  option("checkpointLocation", s"${s3Url}/checkpoint").
  option("path", s"${s3Url}/data").
  option("spark.sql.hive.convertMetastoreParquet", false).
  partitionBy("InvoiceYear", "InvoiceMonth", "InvoiceDay", "InvoiceHour").
  start()

我可以使用 hdfs CLI 查看数据:

[clsadmin@xxxxx ~]$ hdfs dfs -ls s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0 | head
Found 616 items
-rw-rw-rw-   1 clsadmin clsadmin      38085 2018-09-25 01:01 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-1e1dda99-bec2-447c-9bd7-bedb1944f4a9.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      45874 2018-09-25 00:31 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-28ff873e-8a9c-4128-9188-c7b763c5b4ae.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin       5124 2018-09-25 01:10 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-5f768960-4b29-4bce-8f31-2ca9f0d42cb5.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      40154 2018-09-25 00:20 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-70abc027-1f88-4259-a223-21c4153e2a85.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      41282 2018-09-25 00:50 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-873a1caa-3ecc-424a-8b7c-0b2dc1885de4.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      41241 2018-09-25 00:40 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-88b617bf-e35c-4f24-acec-274497b1fd31.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin       3114 2018-09-25 00:01 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-deae2a19-1719-4dfa-afb6-33b57f2d73bb.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      38877 2018-09-25 00:10 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-e07429a2-43dc-4e5b-8fe7-c55ec68783b3.c000.snappy.parquet
-rw-rw-rw-   1 clsadmin clsadmin      39060 2018-09-25 00:20 s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00001-1553da20-14d0-4c06-ae87-45d22914edba.c000.snappy.parquet

但是,当我尝试查询数据时:

hive> select * from invoiceitems limit 5;
OK
Time taken: 2.392 seconds

我的表 DDL 如下所示:

CREATE EXTERNAL TABLE `invoiceitems`(
  `invoiceno` int,
  `stockcode` int,
  `description` string,
  `quantity` int,
  `invoicedate` bigint,
  `unitprice` double,
  `customerid` int,
  `country` string,
  `lineno` int,
  `invoicetime` string,
  `storeid` int,
  `transactionid` string,
  `invoicedatestring` string)
PARTITIONED BY (
  `invoiceyear` int,
  `invoicemonth` int,
  `invoiceday` int,
  `invoicehour` int)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://streaming-data-landing-zone-partitioned/data'

我也尝试过使用正确的列/分区名称大小写 - 这也不起作用。

任何想法为什么我的查询没有找到数据?


更新 1:

我尝试将位置设置为包含没有分区的数据的目录,但这仍然不起作用,所以我想知道这是否是数据格式问题?

CREATE EXTERNAL TABLE `invoiceitems`(
  `InvoiceNo` int,
  `StockCode` int,
  `Description` string,
  `Quantity` int,
  `InvoiceDate` bigint,
  `UnitPrice` double,
  `CustomerID` int,
  `Country` string,
  `LineNo` int,
  `InvoiceTime` string,
  `StoreID` int,
  `TransactionID` string,
  `InvoiceDateString` string)
PARTITIONED BY (
  `InvoiceYear` int,
  `InvoiceMonth` int,
  `InvoiceDay` int,
  `InvoiceHour` int)
STORED AS PARQUET
LOCATION
  's3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/';

hive> Select * from invoiceitems limit 5;
OK
Time taken: 2.066 seconds
4

1 回答 1

1

从 Snappy Compression parquet 文件中读取

数据采用 snappy 压缩 Parquet 文件格式。

s3a://streaming-data-landing-zone-partitioned/data/InvoiceYear=2018/InvoiceMonth=9/InvoiceDay=25/InvoiceHour=0/part-00000-1e1dda99-bec2-447c-9bd7-bedb1944f4a9.c000.snappy.parquet

因此,在创建表 DDL 语句中设置 'PARQUET.COMPRESS'='SNAPPY' 表属性。您也可以在 Ambari 的“自定义配置单元站点设置”部分中为 IOP 或 HDP 设置 parquet.compression=SNAPPY。

以下是在 Hive 中的表创建语句期间使用 table 属性的示例:

hive> CREATE TABLE inv_hive_parquet( 
   trans_id int, product varchar(50), trans_dt date
    )
 PARTITIONED BY (
        year int)
 STORED AS PARQUET
 TBLPROPERTIES ('PARQUET.COMPRESS'='SNAPPY');

更新外部表中的分区元数据

此外,对于外部分区表,我们需要在任何外部作业(在本例中为 spark 作业)直接将分区写入 Datafolder 时更新分区元数据,因为除非明确更新,否则 hive 不会知道这些分区。

这可以通过以下任一方式完成:

ALTER TABLE inv_hive_parquet RECOVER PARTITIONS;
//or
MSCK REPAIR TABLE inv_hive_parquet;
于 2018-09-25T16:07:26.173 回答