java - Cascading Text如何分隔日志文件

Question

我正在关注其网站上的 Cascading 指南。我有以下 TSV 格式输入：

doc_id  text
doc01   A rain shadow is a dry area on the lee back side of a mountainous area.
doc02   This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03   A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04   This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05   Two Women. Secrets. A Broken Land. [DVD Australia]

我使用以下代码来处理它：

Tap docTap = new Hfs(new TextDelimited(true, "\t"), inPath);
...
Fields token = new Fields("token");
Fields text = new Fields("text");
RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");
// only returns "token"
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);

看起来只是拆分每行的第二部分（忽略 doc_id 部分）。Cascading 如何忽略第一个 doc_id 部分而只处理第二部分？是因为 TextDelimited 吗？

score 0 · Accepted Answer

如果你看到管道语句

Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);

第二个参数是您发送到拆分器函数的唯一字段。在这里，您正在发送“文本”字段。所以只有文本被发送到拆分器并返回令牌。

下面清楚地解释了 Each 方法。

每个

@ConstructorProperties(value={"name","argumentSelector","function","outputSelector"})
public Each(String name,
                                   Fields argumentSelector,
                                   Function function,
                                   Fields outputSelector)

Only pass argumentFields to the given function, only return fields selected by the outputSelector.

Parameters:
    name - name for this branch of Pipes
    argumentSelector - field selector that selects Function arguments from the input Tuple
    function - Function to be applied to each input Tuple
    outputSelector - field selector that selects the output Tuple from the input and Function results Tuples

score 0 · Accepted Answer

答案就在这两行

1. Tap创建的方式，程序被告知第一行包含标题（“true”）。

Tap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );

2.其次，在这一行中，列名作为“文本”提供。如果您仔细查看输入文件，“文本”是您尝试基于字数统计的数据的列名。

 Fields text = new Fields( "text" );

java - Cascading Text如何分隔日志文件

2 回答 2

Related

Reference