有趣的问题。这可以通过单个preg_replace()
语句来解决,但必须限制重复短语的长度以避免过度回溯。这是一个带有注释正则表达式的解决方案,适用于测试数据并修复n
最大长度为 50 个字符的双倍、三倍(或重复次数)短语:
第 1 部分的解决方案:
$result = preg_replace('/
# Match a doubled "phrase" having length up to 50 chars.
( # $1: Phrase having whitespace boundaries.
(?<=\s|^) # Assert phrase preceded by ws or BOL.
\S # First char of phrase is non-whitespace.
.{0,49}? # Lazily match phrase (50 chars max).
) # End $1: Phrase
(?: # Group for one or more duplicate phrases.
\s+ # Doubled phrase separated by whitespace.
\1 # Match duplicate of phrase.
){1,} # Require one or more duplicate phrases.
/x', '$1', $text);
请注意,使用此解决方案,“短语”可以由单个单词组成,并且在某些合法情况下,双重单词是有效的语法,不应固定。如果上述解决方案不是所需的行为,则可以轻松修改正则表达式以将“短语”定义为两个或多个“单词”。
编辑:修改上述正则表达式以处理任意数量的短语重复。还为下面问题的第二部分添加了解决方案。
这是一个类似的解决方案,其中短语以数字单词开头,重复短语也必须以数字单词开头(但重复短语的第一个数字单词不需要与原始单词匹配):
第 2 部分的解决方案:
$result = preg_replace('/
# Match doubled "phrases" with wildcard digits first word.
( # $1: 1st word of phrase (digits).
\b # Anchor 1st phrase word to word boundary.
\d+ # Phrase 1st word is string of digits.
\s+ # 1st and 2nd words separated by whitespace.
) # End $1: 1st word of phrase (digits).
( # $2: Part of phrase after 1st digits word.
\S # First char of phrase is non-whitespace.
.{0,49}? # Lazily match phrase (50 chars max).
) # End $2: Part of phrase after 1st digits word.
(?: # Group for one or more duplicate phrases.
\s+ # Doubled phrase separated by whitespace.
\d+ # Match duplicate of phrase.
\s+ # Doubled phrase separated by whitespace.
\2 # Match duplicate of phrase.
){1,} # Require one or more duplicate phrases.
/x', '$1$2', $text);