php - 如何找到数组中最独特的字符串？

Question

我在数组中有很多字符串 - 千。我需要将该数组中的所有字符串相互比较，并从中找到最独特的字符串。

您可以查看并测试我的代码，但正如您所见 - 仅比较 100 个项目需要花费大量时间（本地主机上大约 160 秒 = Intel Core i7），而且我需要比较数千个项目......任何想法如何优化这段代码？

我不需要优化代码的第一部分（生成数据），因为我正在从其他地方提取数据。我只需要优化代码的第二部分（比较）。正如有人注意到的那样，可以通过不进行重复比较（a -> b，b -> a）来优化脚本 - 我知道这一点，但我仍然试图节省比一半多得多的时间。也许有比类似文本更好的比较字符串的功能，但我没有其他经验，这就是我在这里问的原因......

代码：

    <?php

    //set how many strings generate for test
    $number_of_test_strings = 100;


    $strings = array();
    $chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
    $size_chars_array = strlen( $chars );


    /*
     * Creating some random strings - data for test
     */

    //just for testing performance
    $creating_test_data_time_start =  microtime();

    //create some random strings in to array
    for ( $i = 1; $i < $number_of_test_strings; $i++ ) {

        //set random string to empty string
        $random_string = '';

        //choose by random from characters array - also the length of random string will be random - between 1800 and 2500chars
        for( $j = 0; $j < rand ( 1800, 2500); $j++ ) {
                $random_string .= $chars[ rand( 0, $size_chars_array - 1 ) ];
        }

        //insert random string in to strings array
        $strings[] = $random_string;

    }

    //just for testing performance
    $creating_test_data_time_end =  microtime();




    /*
     * Comparison itself
     */


    //just for testing performance
    $uniqueness_time_start =  microtime();

    //foreach for all strings in array
    foreach ($strings as $key_first_element => $first_element) {

        //reset of matched value
        $matched = 0;

        //foreach with each first element
        foreach ($strings as $key_second_element => $second_element) {

            // dont compare the same string
            if ($key_first_element != $key_second_element) {

                //compare those two strings
                similar_text($first_element, $second_element, $match);

                //add match value to matched
                $matched = ($matched + $match);

            }

        }

        // create average uniqueness for that string
        $uniqueness = ($matched / (count($strings) - 1));

        //store it in array
        $uniqueness_array[$key_first_element] = $uniqueness;

    }

    //sort the array by uniqueness (less match the better)- the best on the beginning
    asort($uniqueness_array);

    //just for testing performance
    $uniqueness_time_end =  microtime();


    //just output performance info
    echo 'Creating of test data: '. (array_sum( explode( ' ' , $creating_test_data_time_end ) ) - array_sum( explode( ' ' , $creating_test_data_time_start ) )) .' s, comparing strings: '. (array_sum( explode( ' ' , $uniqueness_time_end ) ) - array_sum( explode( ' ' , $uniqueness_time_start ) )) .' s<br />';

    $i = 0;
    foreach ($uniqueness_array as $key_string => $uniquness_of_string)
    {

        // output just 10 best results
        if ($i < 10) {
            echo 'Uniqueness of a string with key '.$key_string.' is '.$uniquness_of_string.'<br />';    
            $i++;
        }
        else break;

    }

    ?>

预期输入和输出：

    //Expected input array
    $input = array(
        'Today is a great day for skiing and I dont have enough time',
        'Wednesday is a very good day for skiing and snowboarding and I dont have enough time',
        'Today is a superior day for skiing and I dont have enough time',
        'Completly different string about nothing'
    );


    //Expected output array - the order is important - the most different strings at the beginning of the array
    $output = array(
        'Completly different string about nothing',
        'Wednesday is a very good day for skiing and snowboarding and I dont have enough time',
        'Today is a superior day for skiing and I dont have enough time',
        'Today is a great day for skiing and I dont have enough time'
    );

score 1 · Accepted Answer

我真的认为这similar_text还不够......你可以将它与它结合起来levenshtein以获得你想要的结果。

$words = array(
    'Today is a great day for skiing and I dont have enough time',
    'Wednesday is a very good day for skiing and snowboarding and I dont have enough time',
    'Today is a superior day for skiing and I dont have enough time',
    'Completly different string about nothing'
);

$unique = array_map(function ($v) use($words) {
    return new Word($words, $v);
}, $words);

使用相似文本

echo "Uniqness By similar_text\n\n";
usort($unique, function ($a, $b) {
    $a = $a->getSimilar();
    $b = $b->getSimilar();
    return ($a == $b) ? 0 : (($a < $b) ? - 1 : 1);
});


foreach ( $unique as $var ) {
    printf("%s (%s) \n",$var->getWord(),$var->getSimilar());
}

相似文本输出

Uniqness By similar_text

Completly different string about nothing (36.363636363636) 
Wednesday is a very good day for skiing and snowboarding and I dont have enough time (75.342465753425) 
Today is a great day for skiing and I dont have enough time (90.909090909091) 
Today is a superior day for skiing and I dont have enough time (90.909090909091)

如您所见Today is a great，并且Today is a superior位置不正确

使用 levenshtein

echo "\n\nUniqness By levenshtein\n\n";
usort($unique, function ($a, $b) {
    $a = $a->getLev();
    $b = $b->getLev();
    return ($a == $b) ? 0 : (($a < $b) ? 1 : - 1);
});

foreach ( $unique as $var ) {
    printf("%s (%s) \n", $var->getWord(), $var->getLev());
}

编辑输出

Uniqness By levenshtein

Completly different string about nothing (63) 
Wednesday is a very good day for skiing and snowboarding and I dont have enough time (63) 
Today is a superior day for skiing and I dont have enough time (45) 
Today is a great day for skiing and I dont have enough time (43)

如您所见Today is a superior，Today is a great两者的距离非常近levenshtein..如果它们最终相同，则结果可能不是最新的

结合两者以获得简单索引

echo "\n\nUniqness By Simple Index \n\n";
usort($unique, function ($a, $b) {
    $a = $a->getIndex();
    $b = $b->getIndex();
    return ($a == $b) ? 0 : (($a < $b) ? - 1 : 1);
});

foreach ( $unique as $var ) {
    printf("%s (%s) \n", $var->getWord(), $var->getIndex());
}

简单索引输出

Uniqness By Simple Index 

Completly different string about nothing (0.57720057720058) 
Wednesday is a very good day for skiing and snowboarding and I dont have enough time (1.1959121548163) 
Today is a superior day for skiing and I dont have enough time (2.020202020202) 
Today is a great day for skiing and I dont have enough time (2.1141649048626)

将两者结合起来可以更好地解决可能的冲突

使用的类

class Word {
    private $lev = 0;
    private $similar = 0;
    private $index = 0;
    private $word;

    function __construct($words, $word) {
        $this->word = $word;
        foreach ( $words as $selected ) {

            if ($selected == $word)
                continue;

            $lev = levenshtein($word, $selected);
            if ($lev > $this->lev)
                $this->lev = $lev;
            similar_text($word, $selected, $match);

            if ($match > $this->similar)
                $this->similar = $match;
        }

        $this->index = $this->similar / $this->lev;
    }

    function getLev() {
        return $this->lev;
    }

    function getSimilar() {
        return $this->similar;
    }

    function getIndex() {
        return $this->index;
    }

    function getWord() {
        return $this->word;
    }
}

观看现场演示

php - 如何找到数组中最独特的字符串？

1 回答 1

Related

Reference