hadoop - 从逗号分隔的绝对文件路径列表中加载数据

Question

参考以下 hive 命令：

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

我可以给出一个逗号分隔的绝对文件路径列表吗？
LOAD DATA INPATH 'hdfs://foo/bar1,hdfs://foo/bar2' INTO TABLE foo1

我正在寻找的实际用例：

使用时

<datasets>
      <dataset name="input1">
         <uri-template>hdfs://foo/bar/a</uri-template>
      </dataset>
</datasets>
<input-events>
      <data-in name="coordInput1" dataset="input1">
          <start-instance>${coord:current(-23)}</start-instance>
          <end-instance>${coord:current(0)}</end-instance>
      </data-in>
</input-events>
<action>
  <workflow>
         ...
     <configuration>
       <property>
          <name>input_files</name>
          <value>${coord:dataIn('coordInput1')}</value>
       </property>
     </configuration>
  </workflow>
 </action>

在 co-ordinator.xml 中，假设有一组合格的 24 个 hdfs 位置作为我的输入。如果我的配置单元查询是将数据从所有这些位置加载到表中，我想像这样使用它： CREATE TABLE table1( col1 STRING )LOCATION (${input_files});

但是，这在蜂巢中无法正常工作。假设： input_files 解析为hdfs://foo/bar/1,hdfs://foo/bar/2,hdfs://foo/bar/3这不是蜂巢中的有效位置。

我理解实现这一点的唯一方法是运行一个 java 映射器，它将input_files作为输入并输出一个运行的动态配置单元脚本

`LOAD DATA INPATH 'hdfs://foo/bar/1' INTO TABLE foo1`
`LOAD DATA INPATH 'hdfs://foo/bar/2' INTO TABLE foo1`

分别地。

所以，最后，我的问题是，当我能够解决我感兴趣的整个数据集时，我可以将其用作${coord:dataIn('coordInput1')}，我不能利用它来配置蜂巢，避免单独的个人LOAD DATA..或ALTER TABLE ADD PARTITIONS..陈述吗？

score 0 · Accepted Answer

使用 java 操作节点来执行此逻辑。您可以使用逗号拆分 input_files，并使用 hive jdbc 连接在循环中为所有输入位置执行 hive 命令。

hadoop - 从逗号分隔的绝对文件路径列表中加载数据

1 回答 1

Related

Reference