我正在查看其官方网站上的 Cascading 教程。它具有以下输入:
doc_id text
doc01 A rain shadow is a dry area on the lee back side of a mountainous area.
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05 Two Women. Secrets. A Broken Land. [DVD Australia]
它看起来像 TSV 格式。
在它的 WordCount 程序中,它有以下代码:
Fields token = new Fields("token");
Fields text = new Fields("text");
RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");
所以我只是很困惑“[\[\]\(\),.]”是什么意思?它只是 grep 输入文件的每一行的第二部分并命名为“token”字段吗?