pyspark - 如何通过pyspark将特定行和列从excel表加载到HIVE表？

Question

我有一个包含 4 个工作表的 excel 文件。每个工作表的前 3 行为空白，即数据从第 4 行开始，并持续到数千行。注意：根据要求，我不应该删除空白行。

我的目标如下

1) read the excel file in spark 2.1
2) ignore the first 3 rows, and read the data from 4th row to row number 50. The file has more than 2000 rows. 
3) convert all the worksheets from the excel to separate CSV, and load them to existing HIVE tables.

注意：我可以灵活地为每个工作表编写单独的代码。

我怎样才能做到这一点？

我可以创建一个 Df 来读取单个文件并将其加载到 HIVE。但我想我的要求还不止这些。

score 1 · Accepted Answer

例如，您可以使用 HadoopOffice 库 ( https://github.com/ZuInnoTe/hadoopoffice/wiki )。

在那里，您有以下选择：

1) 直接使用 Hive 读取 Excel 文件并将 CTAS 转换为 CSV 格式的表格您需要部署 HadoopOffice Excel Serde https://github.com/ZuInnoTe/hadoopoffice/wiki/Hive-Serde 然后您需要创建表格（请参阅所有选项的文档，示例从 sheet1 读取并跳过前 3 行）

create external table ExcelTable(<INSERTHEREYOURCOLUMNSPECIFICATION>) ROW FORMAT  SERDE 'org.zuinnote.hadoop.excel.hive.serde.ExcelSerde' STORED AS INPUTFORMAT 'org.zuinnote.hadoop.office.format.mapred.ExcelFileInputFormat' OUTPUTFORMAT 'org.zuinnote.hadoop.excel.hive.outputformat.HiveExcelRowFileOutputFormat' LOCATION '/user/office/files' TBLPROPERTIES("hadoopoffice.read.simple.decimalFormat"="US","hadoopoffice.read.sheet.skiplines.num"="3", "hadoopoffice.read.sheet.skiplines.allsheets"="true", "hadoopoffice.read.sheets"="Sheet1","hadoopoffice.read.locale.bcp47"="US","hadoopoffice.write.locale.bcp47"="US");

然后做CTAS成CSV格式的表格：

create table CSVTable ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' AS Select * from ExcelTable;

2) 使用 Spark 根据 Spark 版本，您有不同的选择：对于 Spark 1.x，您可以使用 HadoopOffice 文件格式，对于 Spark 2.x，您可以使用Spark2 DataSource（后者还包括对 Python 的支持）。在此处查看操作方法

pyspark - 如何通过pyspark将特定行和列从excel表加载到HIVE表？

1 回答 1

Related

Reference