cascading - 级联 RegexSplitGenerator 混淆

Question

我正在查看其官方网站上的 Cascading 教程。它具有以下输入：

doc_id  text
doc01   A rain shadow is a dry area on the lee back side of a mountainous area.
doc02   This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover.
doc03   A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain.
doc04   This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley.
doc05   Two Women. Secrets. A Broken Land. [DVD Australia]

它看起来像 TSV 格式。

在它的 WordCount 程序中，它有以下代码：

Fields token = new Fields("token");
Fields text = new Fields("text");
RegexSplitGenerator splitter = new RegexSplitGenerator(token, "[ \\[\\]\\(\\),.]");

所以我只是很困惑“[\[\]\(\),.]”是什么意思？它只是 grep 输入文件的每一行的第二部分并命名为“token”字段吗？

score 1 · Accepted Answer

要理解这个正则表达式的含义，让我们先看看两件事

我们正在努力实现的目标是什么？
我们的数据是什么样的？

以上问题的答案是

我们正在尝试解决字数问题。

数据有

a. words
b. commas ','
c. dots '.'
d. square open and closing brackets '[' and ']'
e. round open and closing brackets '(' and ')'
f. spaces

所以让我们再看一下#1 问题，即计算单词，在我们这样做之前，我们必须删除逗号、点、方括号、圆括号和空格。

现在，我们准备好查看正则表达式

Regex:[ \\[\\]\\(\\),.]

让我们展开

\\[   --> will remove all square open brackets
\\]   --> will remove all square close brackets
\\(   --> will remove all round open brackets
\\)   --> will remove all round close brackets 
,     --> will remove all commas
.     --> will remove all dot
space --> if you look closely at the regex, it actually has a space at the very beginning which removes all the spaces from the data.

正则表达式执行后，只剩下单词，可用于字数统计。

score 0 · Accepted Answer

外括号[]定义了一个字符类。

正则表达式匹配可能是空格、左括号、右括号、左括号、右括号、逗号或句点的单个字符。

反斜杠涉及两个级别的转义：

两个反斜杠在文字字符串中产生一个反斜杠。
单个反斜杠确保下一个字符被视为文字，而不是具有特殊含义的正则表达式字符。

顺便说一句，我不确定是否\\(\\)有必要。我认为()在字符类中应该足够了，但我可能错了。在字符类中，字符被视为文字，尽管您仍然必须小心转义括号，以免过早结束正则表达式。

cascading - 级联 RegexSplitGenerator 混淆

2 回答 2

Related

Reference