是否有可能在文本中找到所有重复的错误打印(在我的情况下是 LaTeX 源),例如:
... The Lagrangian that that includes this potential ...
... This is confimided by the the theorem of ...
使用正则表达式?
使用您最喜欢的工具(sed、grep)/语言(python、perl、...)
是否有可能在文本中找到所有重复的错误打印(在我的情况下是 LaTeX 源),例如:
... The Lagrangian that that includes this potential ...
... This is confimided by the the theorem of ...
使用正则表达式?
使用您最喜欢的工具(sed、grep)/语言(python、perl、...)
尝试这个:
grep -E '\b(\w+)\s+\1\b' myfile.txt
使用带有egrep -w
和正则表达式的反向引用(\w+)\s+\1
:
$ echo "The Lagrangian that that includes this potential" | egrep -ow "(\w+)\s\1"
that that
$ echo "This is confimided by the the theorem of" | egrep -ow "(\w+)\s+\1"
the the
注意:该-o
选项显示匹配的行的唯一部分,这对于演示实际匹配的内容很有用,您可能希望删除该选项并--color
改用。该-w
选项对于匹配整个单词很重要,否则is is
将匹配 in This is con..
。
(\w+) # Matches & captures one or more word characters ([A-Za-z0-9_])
\s+ # Match one or more whitespace characters
\1 # The last captured word
使用 egrep -w --color "(\w+)\s+\1" file
的好处是可以清楚地突出显示潜在的错误重复单词,替换可能不像许多正确示例那样明智,例如reggae raggae sauce
orbeautiful beautiful day
将被更改。
此 JavaScript 示例有效:
var s = '... The Lagrangian that that includes this potential ... This is confimided by the the theorem of ...'
var result = s.match(/\b(\w+)\s\1\b/gi)
结果:
["that that", "the the"];
正则表达式:
/\s(\w+)\s\1/gi
# / --> Regex start,
# \b --> A word boundary,
# (\w+) --> Followed by a word, grouped,
# \s --> Followed by a space,
# \1 --> Followed by the word in group 1,
# \b --> Followed by a word boundary,
# /gi --> End regex, (g)lobal flag, case (i)nsensitive flag.
添加单词边界以防止正则表达式匹配字符串,例如"hot hotel"
或"nice ice"
Python 中的示例显示如何删除重复的单词:
In [1]: import re
In [2]: s1 = '... The Lagrangian that that includes this potential ...'
In [3]: s2 = '... This is confimided by the the theorem of ...'
In [4]: regex = r'\b(\w+)\s+\1\b'
In [5]: re.sub(regex, '\g<1>', s1)
Out[5]: '... The Lagrangian that includes this potential ...'
In [6]: re.sub(regex, '\g<1>', s2)
Out[6]: '... This is confimided by the theorem of ...'