我在数组中有很多字符串 - 千。我需要将该数组中的所有字符串相互比较,并从中找到最独特的字符串。
您可以查看并测试我的代码,但正如您所见 - 仅比较 100 个项目需要花费大量时间(本地主机上大约 160 秒 = Intel Core i7),而且我需要比较数千个项目......任何想法如何优化这段代码?
我不需要优化代码的第一部分(生成数据),因为我正在从其他地方提取数据。我只需要优化代码的第二部分(比较)。正如有人注意到的那样,可以通过不进行重复比较(a -> b,b -> a)来优化脚本 - 我知道这一点,但我仍然试图节省比一半多得多的时间。也许有比类似文本更好的比较字符串的功能,但我没有其他经验,这就是我在这里问的原因......
代码:
<?php
//set how many strings generate for test
$number_of_test_strings = 100;
$strings = array();
$chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
$size_chars_array = strlen( $chars );
/*
* Creating some random strings - data for test
*/
//just for testing performance
$creating_test_data_time_start = microtime();
//create some random strings in to array
for ( $i = 1; $i < $number_of_test_strings; $i++ ) {
//set random string to empty string
$random_string = '';
//choose by random from characters array - also the length of random string will be random - between 1800 and 2500chars
for( $j = 0; $j < rand ( 1800, 2500); $j++ ) {
$random_string .= $chars[ rand( 0, $size_chars_array - 1 ) ];
}
//insert random string in to strings array
$strings[] = $random_string;
}
//just for testing performance
$creating_test_data_time_end = microtime();
/*
* Comparison itself
*/
//just for testing performance
$uniqueness_time_start = microtime();
//foreach for all strings in array
foreach ($strings as $key_first_element => $first_element) {
//reset of matched value
$matched = 0;
//foreach with each first element
foreach ($strings as $key_second_element => $second_element) {
// dont compare the same string
if ($key_first_element != $key_second_element) {
//compare those two strings
similar_text($first_element, $second_element, $match);
//add match value to matched
$matched = ($matched + $match);
}
}
// create average uniqueness for that string
$uniqueness = ($matched / (count($strings) - 1));
//store it in array
$uniqueness_array[$key_first_element] = $uniqueness;
}
//sort the array by uniqueness (less match the better)- the best on the beginning
asort($uniqueness_array);
//just for testing performance
$uniqueness_time_end = microtime();
//just output performance info
echo 'Creating of test data: '. (array_sum( explode( ' ' , $creating_test_data_time_end ) ) - array_sum( explode( ' ' , $creating_test_data_time_start ) )) .' s, comparing strings: '. (array_sum( explode( ' ' , $uniqueness_time_end ) ) - array_sum( explode( ' ' , $uniqueness_time_start ) )) .' s<br />';
$i = 0;
foreach ($uniqueness_array as $key_string => $uniquness_of_string)
{
// output just 10 best results
if ($i < 10) {
echo 'Uniqueness of a string with key '.$key_string.' is '.$uniquness_of_string.'<br />';
$i++;
}
else break;
}
?>
预期输入和输出:
//Expected input array
$input = array(
'Today is a great day for skiing and I dont have enough time',
'Wednesday is a very good day for skiing and snowboarding and I dont have enough time',
'Today is a superior day for skiing and I dont have enough time',
'Completly different string about nothing'
);
//Expected output array - the order is important - the most different strings at the beginning of the array
$output = array(
'Completly different string about nothing',
'Wednesday is a very good day for skiing and snowboarding and I dont have enough time',
'Today is a superior day for skiing and I dont have enough time',
'Today is a great day for skiing and I dont have enough time'
);