php - 替换字符串中的重复字符串

Question

我正在尝试在字符串中查找（并替换）重复的字符串。

我的字符串可能如下所示：

Lorem ipsum dolor sit amet sat amet sat amet sat nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat。

这应该变成：

Lorem ipsum dolor sit amet sat nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat。

请注意，由于没有重复，因此没有删除amit sat 。

或者字符串可以是这样的：

Lorem ipsum dolor sit amet () sat amet () sat amet () sat nostrud exercitation ullamco laboris nisi ut aliquip aliquip ex ea commodo consequat。

应该变成：

Lorem ipsum dolor sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat。

所以它不仅是 az，还可以有其他（ascii）字符。如果有人可以帮助我，我很高兴。

下一步是匹配（和替换）如下内容：

2 个问题 3 个问题 4 个问题 5 个问题

这将成为：

2个问题

最终输出中的数字可以是任意数字 2,3,4，没关系。最后一个例子中只有不同的数字，但单词是相同的。

score 2 · Accepted Answer

如果有帮助，\1,\2等用于引用之前的分组。因此，例如，以下将挑选出重复的单词并让它们只重复一次：

$string =~ s/(\w+) ( \1)+/$1/g

重复的短语可以类似地放置。

score 2 · Accepted Answer

有趣的问题。这可以通过单个preg_replace()语句来解决，但必须限制重复短语的长度以避免过度回溯。这是一个带有注释正则表达式的解决方案，适用于测试数据并修复n最大长度为 50 个字符的双倍、三倍（或重复次数）短语：

第 1 部分的解决方案：

$result = preg_replace('/
    # Match a doubled "phrase" having length up to 50 chars.
    (            # $1: Phrase having whitespace boundaries.
      (?<=\s|^)  # Assert phrase preceded by ws or BOL.
      \S         # First char of phrase is non-whitespace.
      .{0,49}?   # Lazily match phrase (50 chars max).
    )            # End $1: Phrase
    (?:          # Group for one or more duplicate phrases.
      \s+        # Doubled phrase separated by whitespace.
      \1         # Match duplicate of phrase.
    ){1,}        # Require one or more duplicate phrases.
    /x', '$1', $text);

请注意，使用此解决方案，“短语”可以由单个单词组成，并且在某些合法情况下，双重单词是有效的语法，不应固定。如果上述解决方案不是所需的行为，则可以轻松修改正则表达式以将“短语”定义为两个或多个“单词”。

编辑：修改上述正则表达式以处理任意数量的短语重复。还为下面问题的第二部分添加了解决方案。

这是一个类似的解决方案，其中短语以数字单词开头，重复短语也必须以数字单词开头（但重复短语的第一个数字单词不需要与原始单词匹配）：

第 2 部分的解决方案：

$result = preg_replace('/
    # Match doubled "phrases" with wildcard digits first word.
    (            # $1: 1st word of phrase (digits).
    \b           # Anchor 1st phrase word to word boundary.
    \d+          # Phrase 1st word is string of digits.
    \s+          # 1st and 2nd words separated by whitespace.
    )            # End $1:  1st word of phrase (digits).
    (            # $2: Part of phrase after 1st digits word.
      \S         # First char of phrase is non-whitespace.
      .{0,49}?   # Lazily match phrase (50 chars max).
    )            # End $2: Part of phrase after 1st digits word.
    (?:          # Group for one or more duplicate phrases.
      \s+        # Doubled phrase separated by whitespace.
      \d+        # Match duplicate of phrase.
      \s+        # Doubled phrase separated by whitespace.
      \2         # Match duplicate of phrase.
    ){1,}        # Require one or more duplicate phrases.
    /x', '$1$2', $text);

score 1 · Accepted Answer

((?:\b|^)[\x20-\x7E]+)(\1)+将匹配任何从单词边界开始的可打印 ASCII 字符的重复字符串。这意味着它会匹配hello hello，但不会匹配 hello 中的双 l。

如果要调整将匹配的字符，可以在表单中更改和添加范围\x##-\x##\x##-\x##（其中## 是十六进制值）并省略-\x##您只想添加一个字符的位置。

我能看到的唯一问题是，这种有点简单的方法会挑选出合法重复的单词而不是重复的短语。如果你想强制它只挑选由多个单词组成的重复短语，你可以使用类似((?:\b|^)[\x20-\x7E]+\s)(\1)+的东西（注意 extra \s）。

((?:\b|^)[\x20-\x7E]+\s)(.*(\1))+正在接近解决您的第二个问题，但我可能认为自己在那个问题上陷入了困境。

编辑：只是为了澄清，你会$string ~= /((?:\b|^)[\x20-\x7E]+\s)(.*(\1))+/$1/ig在 Perl 或 PHP 中使用它。

score 1 · Accepted Answer

好旧的蛮力...

它是如此丑陋，我倾向于将其发布为eval(base64_decode(...))，但这里是：

function fixi($str) {
    $a = explode(" ", $str);
    return implode(' ', fix($a));
}

function fix($a) {
    $l = count($a);
    $len = 0;
    for($i=1; $i <= $l/2; $i++) {
        for($j=0; $j <= $l - 2*$i; $j++) {
            $n = 1;
            $found = false;
            while(1) {
                $a1 = array_slice($a, $j, $i);
                $a2 = array_slice($a, $j+$n*$i, $i);
                if ($a1 != $a2)
                    break;
                $found = true;
                $n++;
            }
            if ($found && $n*$i > $len) {
                $len = $n*$i;
                $f_j = $j;
                $f_i = $i;
            }
        }
    }
    if ($len) {
        return array_merge(
            fix(array_slice($a, 0, $f_j)),
            array_slice($a, $f_j, $f_i),
            fix(array_slice($a, $f_j+$len, $l))
        );
    }
    return $a;
}

标点符号是单词的一部分，所以不要指望奇迹。

score 1 · Accepted Answer

2 个问题 3 个问题 4 个问题 5 个问题

变得

2个问题

可以使用以下方法解决：

$string =~ s/(\d+ (.*))( \d+ (\2))+/$1/g;

它匹配一个数字后跟任何东西（贪婪地），然后是一系列以空格开头的东西，后跟一个数字，然后是与前面的东西匹配的东西。对于所有这一切，它用第一个数字任何对替换它。

score 0 · Accepted Answer

第一个任务解决代码：

<?php

    function split_repeating($string)
    {
        $words = explode(' ', $string);
        $words_count = count($words);

        $need_remove = array();
        for ($i = 0; $i < $words_count; $i++) {
            $need_remove[$i] = false;
        }

        // Here I iterate through the number of words that will be repeated and check all the possible positions reps
        for ($i = round($words_count / 2); $i >= 1; $i--) {
            for ($j = 0; $j < ($words_count - $i); $j++) {
                $need_remove_item = !$need_remove[$j];
                for ($k = $j; $k < ($j + $i); $k++) {
                    if ($words[$k] != $words[$k + $i]) {
                        $need_remove_item = false;
                        break;
                    }
                }
                if ($need_remove_item) {
                    for ($k = $j; $k < ($j + $i); $k++) {
                        $need_remove[$k] = true;
                    }
                }
            }
        }

        $result_string = '';
        for ($i = 0; $i < $words_count; $i++) {
            if (!$need_remove[$i]) {
                $result_string .= ' ' . $words[$i];
            }
        }
        return trim($result_string);
    }



    $string = 'Lorem ipsum dolor sit amet sit amet sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo 'Lorem ipsum dolor sit amet sit nostrud exercitation amit sit ullamco laboris nisi ut aliquip ex ea commodo consequat.' . '<br>' . '<br>';



    $string = 'Lorem ipsum dolor sit amet () sit amet () sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip aliquip ex ea commodo consequat.';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo 'Lorem ipsum dolor sit amet () sit nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.';

?>

第二个任务解决代码：

<?php

    function split_repeating($string)
    {
        $words = explode(' ', $string);
        $words_count = count($words);

        $need_remove = array();
        for ($i = 0; $i < $words_count; $i++) {
            $need_remove[$i] = false;
        }

        for ($j = 0; $j < ($words_count - 1); $j++) {
            $need_remove_item = !$need_remove[$j];
            for ($k = $j + 1; $k < ($words_count - 1); $k += 2) {
                if ($words[$k] != $words[$k + 2]) {
                    $need_remove_item = false;
                    break;
                }
            }
            if ($need_remove_item) {
                for ($k = $j + 2; $k < $words_count; $k++) {
                    $need_remove[$k] = true;
                }
            }
        }

        $result_string = '';
        for ($i = 0; $i < $words_count; $i++) {
            if (!$need_remove[$i]) {
                $result_string .= ' ' . $words[$i];
            }
        }
        return trim($result_string);
    }



    $string = '2 questions 3 questions 4 questions 5 questions';

    echo $string . '<br>';
    echo split_repeating($string) . '<br>';
    echo '2 questions';

?>

score 0 · Accepted Answer

非常感谢大家回答问题。这对我帮助很大！。我尝试了 Ridgerunners 和 dtanders 正则表达式，虽然它们在一些测试字符串上工作（经过一些修改），但我在处理其他字符串时遇到了麻烦。

所以我选择了受 Nox 启发的蛮力攻击 :)。这样我可以结合这两个问题并且仍然具有良好的性能（甚至比正则表达式更好，因为这在 PHP 中很慢）。

对于任何对此感兴趣的人，这里是概念代码：

function split_repeating_num($string) {
$words = explode(' ', $string);
$all_words = $words;
$num_words = count($words);
$max_length = 100; //max length of substring to check
$max_words = 4; //maximum number of words in substring 
$found = array();
$current_pos = 0;
$unset = array();
foreach ($words as $key=>$word) {
    //see if this word exist in the next part of the string
    $len = strlen($word);
    if ($len === 0) continue;
    $current_pos += $len + 1; //+1 for the space
    $substr = substr($string, $current_pos, $max_length);
    if (($pos = strpos(substr($string, $current_pos, $max_length), $word)) !== false) {
        //found it
        //set pointer words and all_words to same value
        while (key($all_words) < $key ) next($all_words);
        while (key($all_words) > $key ) prev($all_words);
        $next_word = next($all_words);

        while (is_numeric($next_word) || $next_word === '') {
            $next_word = next($all_words);
        }
        // see if it follows the word directly
        if ($word === $next_word) {
            $unset [$key] = 1;
        } elseif ($key + 3 < $num_words) {
            for($i = $max_words; $i > 0; $i --) {
                $x = 0;
                $string_a = '';
                $string_b = '';
                while ($x < $i ) {
                    while (is_numeric($next_word) || $next_word === '' ) {
                        $next_word = each($all_words);
                    }
                    $x ++;
                    $string_a .= $next_word;
                    $string_b .= $words [key($all_words) + $i];
                }

                if ($string_a === $string_b) {
                    //we have a match
                    for($x = $key; $x < $i + $key; $x ++)
                        $unset [$x] = 1;
                }
            }
        }
    }

}
foreach ($unset as $k=>$v) {
    unset($words [$k]);
}
return implode(' ', $words);

}

还有一些小问题，我确实需要测试，但它似乎完成了它的工作。

php - 替换字符串中的重复字符串

7 回答 7

Related

Reference