php - What is the best (cheapest) way to CamelCase complex input strings?

Question

I have a large number of real-time incoming phrases which need to be transofrmed to alpha only - CamelCase by word and split point.

That's what I came up so far, but is there any cheaper and faster way to perform that task?

function FoxJourneyLikeACamelsHump(string $string): string {
  $string = preg_replace("/[^[:alpha:][:space:]]/u", ' ', $string);
  $string = ucwords($string);
  $camelCase = preg_replace('/\s+/', '', $string);
  return $camelCase;
}

// $expected = "ThQuCkBrWnFXJumpsVRThLZyDG";
$string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
$is = FoxJourneyLikeACamelsHump($string);

Results:

Sentences: 100000000
Total time: 40.844197034836 seconds
average: 0.000000408

score 3 · Accepted Answer

Your code is quite efficient. You can still improve with a few tweaks:

Provide the delimiter to ucwords so it does not have to look for \t, \n, etc, which will not be in your string any way after the first step. On average this gives 1% improvement;
You can perform the last step with a non-regex replace on a space. This gives up to 20% improvement.

Code:

function FoxJourneyLikeACamelsHump(string $string): string {
    $string = preg_replace("/[^[:alpha:][:space:]]/u", ' ', $string);
    $string = ucwords($string, ' ');
    $camelCase = str_replace(' ', '', $string);
    return $camelCase;
}

See the timings for the original and improved version on rextester.com.

Note: As you used ucwords, your code cannot be used reliably for unicode strings in general. To cover for that you would need to use a function like mb_convert_case:

$string = mb_convert_case($string,  MB_CASE_TITLE);

... but this has a performance impact.

score 2 · Accepted Answer

Bench-marked against 3 alternatives, I believe your method is the fastest. Here's the results from 100,000 iterations:

array(4) {
  ["Test1"]=>
  float(0.23144102096558)
  ["Test2"]=>
  float(0.41140103340149)
  ["Test3"]=>
  float(0.31215810775757)
  ["Test4"]=>
  float(0.98423790931702)
}

Where Test1 is yours, Test2 and Test3 are mine, and Test4 is from @RizwanMTuman's answer (with a fix).

I thought using preg_split may give you an opportunity to optimise. In this function, only 1 regex is used and returns an array of only the alpha items to which you then apply ucfirst to:

function FoxJourneyLikeACamelsHump_2(string $string): string {
    return implode('', array_map(function($word) {
        return ucfirst($word);
    }, preg_split("/[^[:alpha:]]/", $string, null, PREG_SPLIT_NO_EMPTY)));
}

This can be further optimised by using foreach instead of array_map (see here):

function FoxJourneyLikeACamelsHump_3(string $string): string {
    $validItems = preg_split("/[^[:alpha:]]/u", $string, null, PREG_SPLIT_NO_EMPTY);
    $result = '';
    foreach($validItems as $item) {
        $result .= ucfirst($item);
    }
    return $result;
}

This leads me to speculate that 2 regexes and 1 ucwords is faster than 1 regex and multiple ucfirsts.

Full test script:

<?php

// yours
function FoxJourneyLikeACamelsHump_1(string $string): string {
  $string = preg_replace("/[^[:alpha:][:space:]]/u", ' ', $string);
  $string = ucwords($string);
  $camelCase = preg_replace('/\s+/', '', $string);
  return $camelCase;
}

// mine v1
function FoxJourneyLikeACamelsHump_2(string $string): string {
    return implode('', array_map(function($word) {
        return ucfirst($word);
    }, preg_split("/[^[:alpha:]]/", $string, null, PREG_SPLIT_NO_EMPTY)));
}

// mine v2
function FoxJourneyLikeACamelsHump_3(string $string): string {
    $validItems = preg_split("/[^[:alpha:]]/u", $string, null, PREG_SPLIT_NO_EMPTY);
    $result = '';
    foreach($validItems as $item) {
        $result .= ucfirst($item);
    }
    return $result;
}

// Rizwan with a fix
function FoxJourneyLikeACamelsHump_4(string $string): string {
    $re = '/(?:\b|\d+)([a-z])|[\d+ +!.@]/';
    $result = preg_replace_callback($re,function ($matches) {
        return (isset($matches[1]) ? strtoupper($matches[1]) : '');
    },$string);
    return $result;
}


// $expected = "ThQuCkBrWnFXJumpsVRThLZyDG";
$test1 = 0;
$test2 = 0;
$test3 = 0;
$test4 = 0;

$loops = 100000;

$time_start = microtime(true);
for($i=0; $i<$loops; $i++) {
    $string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
    $is = FoxJourneyLikeACamelsHump_1($string);
    if($loops==1) echo $is."\n";
}
$time_end = microtime(true);
$test1 = $time_end - $time_start;

$time_start = microtime(true);
for($i=0; $i<$loops; $i++) {
    $string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
    $is = FoxJourneyLikeACamelsHump_2($string);
    if($loops==1) echo $is."\n";
}
$time_end = microtime(true);
$test2 = $time_end - $time_start;

$time_start = microtime(true);
for($i=0; $i<$loops; $i++) {
    $string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
    $is = FoxJourneyLikeACamelsHump_3($string);
    if($loops==1) echo $is."\n";
}
$time_end = microtime(true);
$test3 = $time_end - $time_start;

$time_start = microtime(true);
for($i=0; $i<$loops; $i++) {
    $string = " Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ";
    $is = FoxJourneyLikeACamelsHump_4($string);
    if($loops==1) echo $is."\n";
}
$time_end = microtime(true);
$test4 = $time_end - $time_start;

var_dump(array('Test1'=>$test1, 'Test2'=>$test2, 'Test3'=>$test3, 'Test4'=>$test4));

score 1 · Accepted Answer

You can try this regex:

(?:\b|\d+)([a-z])|[\d+ +!.@]

UPDTAE ( Run it here )

Well the idea above is to show you how the thing should be working in regex:

The following is a php implementation of the above regex. You may compare it with yours as this enables the operation to be done by single replace operation:

<?php

$re = '/(?:\b|\d+)([a-z])|[\d+ +!.@]/';
$str = 'Th3 qu!ck br0wn f0x jumps 0v3r th3 l@zy d0g. ';
$subst=strtoupper('\\1');

$result = preg_replace_callback($re,function ($matches) {
return (isset($matches[1]) ? strtoupper($matches[1]) : '');
    },$str);

echo $result;

?>

Regex Demo

score 0 · Accepted Answer

Before thinking to improve performances of a code, you need first to build a code that works. Actually you are trying to build a code that handles utf8 encoded strings (since you added the u modifier to your pattern); but with the string: liberté égalité fraternité your code returns Liberté égalité Fraternité instead of Liberté Égalité Fraternité because ucwords (or ucfirst) are not able to deal with multibyte characters.

After trying different approaches (with preg_split and preg_replace_callback), it seems that this preg_match_all version is the fastest:

function FoxJourneyLikeACamelsHumpUPMA(string $string): string {
    preg_match_all('~\pL+~u', $string, $m);
    foreach ($m[0] as &$v) {
        $v = mb_strtoupper(mb_substr($v, 0, 1)) . mb_strtolower(mb_substr($v, 1));
    }
    return implode('', $m[0]);
}

Obviously, it's slower than your initial code, but we can't really compare these different codes since yours doesn't work.

php - What is the best (cheapest) way to CamelCase complex input strings?

Results:

4 回答 4

Related

Reference