hadoop - PIG 加载 CSV - 地图类型错误

Question

我们的目标是利用 PIG 对我们的服务器日志进行大规模日志分析。我需要从文件中加载 PIG 地图数据类型。

我尝试使用以下数据运行示例 PIG 脚本。

我的 CSV 文件中有一行名为“test”（由 PIG 处理）看起来像，

151364,[ref#R813,highway#secondary]

我的猪脚本

a = LOAD 'test' using PigStorage(',') AS  (id:INT, m:MAP[]);
DUMP a;

这个想法是加载一个 int 和第二个元素作为 hashmap。但是，当我转储时，int 字段被正确解析（并在转储中打印），但 map 字段未解析，导致解析错误。

如果我遗漏了什么，有人可以解释一下吗？

score 1 · Accepted Answer

我认为存在与分隔符相关的问题（例如字段分隔符以某种方式影响映射字段的解析或与映射分隔符混淆）。

当使用此输入数据时（注意我使用分号作为字段分隔符）：

151364;[ref#R813,highway#secondary]

下面是我的 grunt shell 的输出：

grunt> a = LOAD '/tmp/temp2.txt' using PigStorage(';') AS (id:int, m:[]);
grunt> dump a;
...
(151364,[highway#secondary,ref#R813])

grunt> b = foreach a generate m#'ref'; 
grunt> dump b;
(R813)

score 1 · Accepted Answer

Atlast，我发现了问题所在。只需将分隔符从“，”更改为另一个字符，例如管道。字段分隔符与用于地图的分隔符 ',' 混淆了 :)

The string 151364,[ref#R813,highway#secondary] was getting parsed into,
field1: 151364  field2: [ref#R813  field3: highway#secondary]
Since '[ref#$813' is not a valid map field, there is a parse error.

我还查看了 PigStorage 函数的源代码，并确认了解析逻辑 -源代码

@Override
public Tuple getNext() throws IOException {
        for (int i = 0; i < len; i++) {
            //skipping some stuff
            if (buf[i] == fieldDel) { // if we find delim
                readField(buf, start, i); //extract field from prev delim to current
                start = i + 1;
                fieldID++;
            }
        }
 }

因此，由于 PIG 通过分隔符拆分字段，因此会导致字段解析与用于映射的分隔符混淆。

hadoop - PIG 加载 CSV - 地图类型错误

2 回答 2

Related

Reference