hadoop - Hive INSERT OVERWRITE DIRECTORY 命令输出没有用分隔符分隔。为什么？

Question

我正在加载的文件由“”（空格）分隔。下面是文件。该文件位于 HDFS 中：-

1> 我正在创建一个外部表并通过发出以下命令加载文件：-

CREATE EXTERNAL TABLE IF NOT EXISTS graph_edges (src_node_id STRING COMMENT 'Node ID of Source node', dest_node_id STRING COMMENT 'Node ID of Destination node') ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' STORED AS TEXTFILE LOCATION '/user/hadoop/input';

2> 在此之后，我只是通过发出以下命令将表插入到另一个文件中：-

INSERT OVERWRITE DIRECTORY '/user/hadoop/output' SELECT * FROM graph_edges;

3> 现在，当我对文件进行分类时，字段不会被任何分隔符分隔：-

hadoop dfs -cat /user/hadoop/output/000000_0

输出：-

有人可以帮我吗？为什么要删除分隔符以及如何分隔输出文件？

在我尝试过的 CREATE TABLE 命令中，我DELIMITED BY '\t'得到了不必要的 NULL 列。

任何指针都非常感谢。我正在使用 Hive 0.9.0 版本。

score 17 · Accepted Answer

问题是 HIVE 不允许您指定输出分隔符 - https://issues.apache.org/jira/browse/HIVE-634

解决方案是为输出创建外部表（带有分隔符规范）并插入覆盖表而不是目录。

--

假设您在 HDFS 中有 /user/hadoop/input/graph_edges.csv，

hive> create external table graph_edges (src string, dest string) 
    > row format delimited 
    > fields terminated by ' ' 
    > lines terminated by '\n' 
    > stored as textfile location '/user/hadoop/input';

hive> select * from graph_edges;
OK
001 000
001 000
002 001
003 002
004 003
005 004
006 005
007 006
008 007
099 007

hive> create external table graph_out (src string, dest string) 
    > row format delimited 
    > fields terminated by ' ' 
    > lines terminated by '\n' 
    > stored as textfile location '/user/hadoop/output';

hive> insert into table graph_out select * from graph_edges;
hive> select * from graph_out;
OK
001 000
001 000
002 001
003 002
004 003
005 004
006 005
007 006
008 007
099 007

[user@box] hadoop fs -get /user/hadoop/output/000000_0 .

如上所述返回，带有空格。

score 15 · Accepted Answer

虽然该问题已有 2 年多的历史，并且当时的最佳答案是正确的，但现在可以告诉 Hive 将分隔数据写入目录。

以下是使用传统 ^A 分隔符输出数据的示例：

INSERT OVERWRITE DIRECTORY '/output/data_delimited'
SELECT *
FROM data_schema.data_table

现在有了制表符分隔符：

INSERT OVERWRITE DIRECTORY '/output/data_delimited'
row format delimited 
FIELDS TERMINATED BY '\t'
SELECT *
FROM data_schema.data_table

score 11 · Accepted Answer

我认为使用 concat_ws 函数可以实现输出；

INSERT OVERWRITE DIRECTORY '/user/hadoop/output' SELECT concat_ws(',', col1, col2) FROM graph_edges;

这里我选择逗号作为列分隔符

score 4 · Accepted Answer

我有一些不同的声音。

事实上，Hive 不支持自定义分隔符。

但是当您使用时INSERT OVERWRITE DIRECTORY，您的行中有分隔符。分隔符是'\1'.

您可以使用hadoop dfs -cat $file | head -1 | xxd查找它或将文件从 HDFS 获取到本地计算机并使用 vim 打开它。在你的 vim 中会有一些像 '^A' 这样的字符，它是分隔符。

回到问题，你可以用一个简单的方法来解决它。

仍然使用INSERT OVERWRITE DIRECTORY '/user/hadoop/output'生成/user/hadoop/output；

创建外部表，其字段由分隔'\1'：

create external table graph_out (src string, dest string) 
row format delimited 
fields terminated by '\1' 
lines terminated by '\n' 
stored as textfile location '/user/hadoop/output';

score 3 · Accepted Answer

您可以在写入目录时提供分隔符

INSERT OVERWRITE DIRECTORY '/user/hadoop/output'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY
SELECT * FROM graph_edges;

这应该适合你。

score 1 · Accepted Answer

我遇到了这个问题，其中配置单元查询结果的输出应该用管道分隔。运行这个 sed 命令，您可以替换：^A to |

sed 's#\x01#|#g' test.log > piped_test.log

score 0 · Accepted Answer

默认分隔符是"^A"。在 python 语言中，它是"\x01"。

当我想更改分隔符时，我使用如下 SQL：

SELECT col1, delimiter, col2, delimiter, col3, ..., FROM table

然后，将delimiter+"^A"视为新的分隔符。

score 0 · Accepted Answer

我怀疑 hive 实际上正在编写一个 contol-A 作为分隔符，但是当你在屏幕上做一只猫时，它并没有出现在你的眼前。

相反，如果您只想看一点文件，请尝试在 vi 中调出文件或 head 文件，然后 vi 结果：

hadoop dfs -cat /user/hadoop/output/000000_0 | 头> my_local_file.txt

vi my_local_file.txt

您应该能够在其中看到 ^A 字符。

score 0 · Accepted Answer

我想这将是一个更好的解决方案，尽管它是一种实现方式。

INSERT OVERWRITE DIRECTORY '/user/hadoop/output' SELECT src_node_id,' ',dest_node_id FROM graph_edges;

score 0 · Accepted Answer

您可以使用此参数“以'|'终止的行格式分隔字段”，例如在您的情况下应该是

INSERT OVERWRITE DIRECTORY '/user/hadoop/output' 行格式分隔字段以 '|' 终止选择 * 从图边；

hadoop - Hive INSERT OVERWRITE DIRECTORY 命令输出没有用分隔符分隔。为什么？

10 回答 10

Related

Reference