我正在关注其网站上的 Cascading 指南。我有以下 TSV 格式输入:
doc_id text
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]
我使用以下代码来处理它:
Tap docTap = new Hfs(new TextDelimited(true, "\t"), inPath);
...
Fields token = new Fields("token");
Fields text = new Fields("text");
RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");
// only returns "token"
Pipe docPipe = new Each("token", text, splitter, Fields.RESULTS);
看起来只是拆分每行的第二部分(忽略 doc_id 部分)。Cascading 如何忽略第一个 doc_id 部分而只处理第二部分?是因为 TextDelimited 吗?