hadoop - Changing Schemas with Hadoop Cascading

Question

I'm trying to figure out how to use cascading against an archive of data whose schema is additive over time. Why I mean by additive is that it will start out with 3 columns, for example. Then in the next release it might have 5 columns. These columns follow standard CSV layouts. My understanding is that if I specify a schema to be 5 columns long and the old schema is only 3, then Cascading will fail.

Is there a way to tell cascading to fill in the missing columns? Like a default = null?

score 1 · Accepted Answer

事实证明，在分隔文本的情况下，该方案有一个特殊的构造函数。这里的构造函数Cascading JavaDoc表示我们可以调整解析的严格性。如果您说 strict 为 false，Cascading 将加载数据，并在末尾附加 null。对此的困惑似乎是可以理解的，因为在级联用户组中有两个关于如何执行此操作的线程。

score 0 · Accepted Answer

而不是硬编码您的架构，您可以使其配置驱动。

我的意思是您可以在属性文件 /xml 文件中定义您的列列表。

这样您就不需要经常更改代码。

前任：

列：cloumn1，column2，column3

您可以直接将该字符串数组传递给Fields构造函数。

事实上，我已经在我目前的项目中成功地实现了这一点。

hadoop - Changing Schemas with Hadoop Cascading

2 回答 2

Related

Reference