regex - 使用正则表达式查找文本中的重复错印

Question

是否有可能在文本中找到所有重复的错误打印（在我的情况下是 LaTeX 源），例如：

... The Lagrangian that that includes this potential ...
... This is confimided by the the theorem of ...

使用正则表达式？

使用您最喜欢的工具（sed、grep）/语言（python、perl、...）

score 1 · Accepted Answer

1

尝试这个：

grep -E '\b(\w+)\s+\1\b'  myfile.txt

于 2013-01-08T14:25:37.790 回答

score 1 · Accepted Answer

使用带有egrep -w和正则表达式的反向引用(\w+)\s+\1：

$ echo "The Lagrangian that that includes this potential" | egrep -ow "(\w+)\s\1"
that that

$ echo "This is confimided by the the theorem of" | egrep -ow "(\w+)\s+\1"
the the

注意：该-o选项显示匹配的行的唯一部分，这对于演示实际匹配的内容很有用，您可能希望删除该选项并--color改用。该-w选项对于匹配整个单词很重要，否则is is将匹配 in This is con..。

(\w+) # Matches & captures one or more word characters ([A-Za-z0-9_])
\s+   # Match one or more whitespace characters 
\1    # The last captured word

使用 egrep -w --color "(\w+)\s+\1" file的好处是可以清楚地突出显示潜在的错误重复单词，替换可能不像许多正确示例那样明智，例如reggae raggae sauceorbeautiful beautiful day将被更改。

score 1 · Accepted Answer

此 JavaScript 示例有效：

var s = '... The Lagrangian that that includes this potential ... This is confimided by the the theorem of ...'
var result = s.match(/\b(\w+)\s\1\b/gi)

结果：

["that that", "the the"];

正则表达式：

/\s(\w+)\s\1/gi

# /     --> Regex start,
# \b    --> A word boundary,
# (\w+) --> Followed by a word, grouped,
# \s    --> Followed by a space,
# \1    --> Followed by the word in group 1,
# \b    --> Followed by a word boundary,
# /gi   --> End regex, (g)lobal flag, case (i)nsensitive flag.

添加单词边界以防止正则表达式匹配字符串，例如"hot hotel"或"nice ice"

score 0 · Accepted Answer

Python 中的示例显示如何删除重复的单词：

In [1]: import re

In [2]: s1 = '... The Lagrangian that that includes this potential ...'

In [3]: s2 = '... This is confimided by the the theorem of ...'

In [4]: regex = r'\b(\w+)\s+\1\b'

In [5]: re.sub(regex, '\g<1>', s1)
Out[5]: '... The Lagrangian that includes this potential ...'

In [6]: re.sub(regex, '\g<1>', s2)
Out[6]: '... This is confimided by the theorem of ...'

regex - 使用正则表达式查找文本中的重复错印

4 回答 4

Related

Reference