php - 如何使用预定义的字母对 unicode 中的字符串进行排序？

Question

score 3 · Accepted Answer

I've left all of my testing echoes in my code block and merely commented them out in case you wanted to see what is being generated throughout the process.

I took some liberties with your code. I didn't like the function calling the function, and I condensed your lookup array into a space-led string. This will serve to have the same effect as your indexed array that starts from 1. The converting of the lookup from array to string means I can use mb_strpos() instead of array_search().

The crucial point to fix in your code was in the looping, specifically accessing the letters with [$i]. You see, you cannot treat these multibyte characters as single byte characters -- you must use mb_substr() to access the "whole" letter.

Setting values for $alphabet and encoding means, you don't have to write a second "helper" function to pass all of the necessary data. uksort() will pass its expected two arguments and everything goes ahead smoothly.

One final piece of advice is: mb_ functions are expensive, so always try to return in your code as soon as possible and leave the mb_ functions farther "downscript" whenever logically possible.

Here is my suggested code: (Demo)

function alphabetize_custom($a, $b, $alphabet = " -,.ȝjʿwbpfmnrhḥḫẖsšqkgtṯdḏ⸗/()[]<>{}'*#I0123456789&@%", $encoding = 'UTF-8') {
    //echo "\n----\n$a =vs= $b";
    $mb_length = max(mb_strlen($a, $encoding), mb_strlen($b, $encoding));
    for ($i = 0; $i < $mb_length; ++$i) {
        //echo "\n";
        $a_char = mb_substr($a, $i, 1, $encoding);
        $b_char = mb_substr($b, $i, 1, $encoding);
        //echo "$a_char -vs- $b_char\n";
        //echo "(" , mb_strlen($a_char, $encoding), " & ", mb_strlen($b_char, $encoding), ")\n";
        if ($a_char === $b_char) {/*echo "identical, continue";*/ continue;}
        if (!mb_strlen($a_char, $encoding)) { /* echo "a is empty -1";*/ return -1;}
        if (!mb_strlen($b_char, $encoding)) { /*echo "b is empty 1";*/ return 1;}
        $a_offset = mb_strpos($alphabet, $a_char, 0, $encoding);
        $b_offset = mb_strpos($alphabet, $b_char, 0, $encoding);
        //echo "[" , $a_offset, " & ", $b_offset, "]\n";
        if ($a_offset == $b_offset) { /*echo "== offsets, continue";*/ continue;}
        if ($a_offset < $b_offset) { /*echo "a offset -1";*/ return -1;}
        //echo "b offset 1";
        return 1;
    }
    //echo "0";
    return 0;
}

$result = [
    "nṯr" => ["Ka.C.Coptite.urkVIII,176b", "Ka.C.Coptite.urkVIII,177,1"],
    "n" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,2"],
    "nḫȝḫȝ" => ["Ka.C.Coptite.urkVIII,176c"],
    "nwj" => ["Ka.C.Coptite.urkVIII,176c"],
    "nfr" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,2"],
    "nḥḥ" => ["Ka.C.Coptite.urkVIII,176e", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,1"],
    "nḏ" => ["Ka.C.Coptite.urkVIII,177,1"]
];

uksort($result, 'alphabetize_custom');

var_export($result);

Output:

array (
  'n' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
    1 => 'Ka.C.Coptite.urkVIII,177,1',
    2 => 'Ka.C.Coptite.urkVIII,177,2',
  ),
  'nwj' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
  ),
  'nfr' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
    1 => 'Ka.C.Coptite.urkVIII,177,2',
  ),
  'nḥḥ' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176e',
    1 => 'Ka.C.Coptite.urkVIII,177,1',
    2 => 'Ka.C.Coptite.urkVIII,177,1',
  ),
  'nḫȝḫȝ' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176c',
  ),
  'nṯr' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,176b',
    1 => 'Ka.C.Coptite.urkVIII,177,1',
  ),
  'nḏ' => 
  array (
    0 => 'Ka.C.Coptite.urkVIII,177,1',
  ),
)

Just for comparison's sake, I wrote an alternative code block that uses array_search() as your original code does and not surprisingly it appears to be more efficient according to the speed tests on 3v4l.org. This is likely due to the removal of a couple of 4 mb_ functions, which I previously mentioned to be "expensive". The following snippet provides the same output.

Code: (Demo)

function alphabetize_custom($a, $b) {
    $alphabet = [' ', '-', ',', '.', 'ȝ', 'j', 'ʿ', 'w', 'b', 'p', 'f', 'm', 'n', 'r', 'h', 'ḥ', 'ḫ', 'ẖ', 's', 'š', 'q', 'k', 'g', 't', 'ṯ', 'd', 'ḏ', '⸗', '/', '(', ')', '[', ']', '<', '>', '{', '}', "'", '*', '#', 'I', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '&', '@', '%'];
    unset($alphabet[0]);  // removes dummy first key, effectively starting the keys from 1
    $encoding = 'UTF-8';

    $mb_length = max(mb_strlen($a, $encoding), mb_strlen($b, $encoding));
    for ($i = 0; $i < $mb_length; ++$i) {
        $a_char = mb_substr($a, $i, 1, $encoding);
        $b_char = mb_substr($b, $i, 1, $encoding);
        if ($a_char === $b_char) continue;

        $a_key = array_search($a_char, $alphabet);
        $b_key = array_search($b_char, $alphabet);
        if ($a_key === $b_key) continue;

        return $a_key - $b_key;
    }
    return 0;
}

$result = [
    "nṯr" => ["Ka.C.Coptite.urkVIII,176b", "Ka.C.Coptite.urkVIII,177,1"],
    "n" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,2"],
    "nḫȝḫȝ" => ["Ka.C.Coptite.urkVIII,176c"],
    "nwj" => ["Ka.C.Coptite.urkVIII,176c"],
    "nfr" => ["Ka.C.Coptite.urkVIII,176c", "Ka.C.Coptite.urkVIII,177,2"],
    "nḥḥ" => ["Ka.C.Coptite.urkVIII,176e", "Ka.C.Coptite.urkVIII,177,1", "Ka.C.Coptite.urkVIII,177,1"],
    "nḏ" => ["Ka.C.Coptite.urkVIII,177,1"]
];

uksort($result, 'alphabetize_custom');

var_export($result);

score 0 · Accepted Answer

The charset in the meta tag needs to be UTF-8. That is what the outside world calls it; MySQL calls it utf8mb4.

Inside MySQL, declare the collation of the columns you want to be ordered with COLLATION utf8mb4_unicode_520_ci. With that, MySQL can do the work for you:

SELECT ... ORDER BY col ...

php - 如何使用预定义的字母对 unicode 中的字符串进行排序？

2 回答 2

Related

Reference