我在这里(以及重复的问题中)看到的许多其他答案基本上只适用于格式非常明确的数据,例如完全是数字的字符串,或者有固定长度的字母前缀。这在一般情况下不起作用。
确实没有任何方法可以在 MySQL 中实现 100% 通用的 nat-sort,因为要做到这一点,您真正需要的是一个修改后的比较函数,如果/当它遇到时,它会在字符串的字典排序和数字排序之间切换一个号码。这样的代码可以实现任何你想要识别和比较两个字符串中的数字部分的算法。不幸的是,MySQL 中的比较功能是其代码内部的,用户无法更改。
这留下了某种 hack,您尝试为您的字符串创建一个排序键,其中数字部分被重新格式化,以便标准字典排序实际上按照您想要的方式对它们进行排序。
对于最多位数的普通整数,显而易见的解决方案是简单地用零填充它们,以便它们都是固定宽度。这是 Drupal 插件采用的方法,以及 @plalx / @RichardToth 的解决方案。(@Christian 有一个不同且更复杂的解决方案,但它没有提供我能看到的优势)。
正如@tye 指出的那样,您可以通过为每个数字添加固定数字长度来改进这一点,而不是简单地左填充它。不过,即使考虑到本质上是笨拙的 hack 的局限性,您也可以改进很多。然而,那里似乎没有任何预先构建的解决方案!
例如,关于:
- 加号和减号?+10 对 10 对 -10
- 小数点?8.2、8.5、1.006、.75
- 前导零?020, 030, 00000922
- 千位分隔符?“1,001 只斑点狗”与“1001 只斑点狗”
- 版本号?MariaDB v10.3.18 与 MariaDB v10.3.3
- 很长的数字?103,768,276,592,092,364,859,236,487,687,870,234,598.55
扩展 @tye 的方法,我创建了一个相当紧凑的 NatSortKey() 存储函数,它将任意字符串转换为 nat-sort 键,并处理所有上述情况,相当有效,并保留总排序 -顺序(没有两个不同的字符串具有比较相等的排序键)。第二个参数可用于限制每个字符串中处理的数字数量(例如,前 10 个数字),可用于确保输出适合给定长度。
注意:使用此第二个参数的给定值生成的排序键字符串只能针对使用相同参数值生成的其他字符串进行排序,否则它们可能无法正确排序!
您可以直接在订购时使用它,例如
SELECT myString FROM myTable ORDER BY NatSortKey(myString,0); ### 0 means process all numbers - resulting sort key might be quite long for certain inputs
但是为了对大表进行高效排序,最好将排序键预先存储在另一列中(可能带有索引):
INSERT INTO myTable (myString,myStringNSK) VALUES (@theStringValue,NatSortKey(@theStringValue,10)), ...
...
SELECT myString FROM myTable ORDER BY myStringNSK;
[理想情况下,您可以通过将键列创建为计算存储列来自动实现这一点,使用类似:
CREATE TABLE myTable (
...
myString varchar(100),
myStringNSK varchar(150) AS (NatSortKey(myString,10)) STORED,
...
KEY (myStringNSK),
...);
但是目前MySQL 和 MariaDB 都不允许在计算列中存储函数,所以很遗憾你还不能这样做。]
我的功能只影响数字的排序。如果您想做其他排序规范化的事情,例如删除所有标点符号,或修剪每一端的空格,或用单个空格替换多空格序列,您可以扩展函数,或者可以在NatSortKey()
is之前或之后完成应用于您的数据。(我建议REGEXP_REPLACE()
用于此目的)。
它也有点以盎格鲁为中心,因为我假设“。” 用于小数点,',' 用于千位分隔符,但如果您想要反转,或者如果您希望将其作为参数进行切换,它应该很容易修改。
它可能会以其他方式进一步改进;例如,它目前按绝对值对负数进行排序,因此 -1 在 -2 之前,而不是相反。也无法在为文本保留 ASC 词典排序的同时为数字指定 DESC 排序顺序。这两个问题都可以通过更多的工作来解决;如果/当我有时间时,我会更新代码。
还有许多其他细节需要注意——包括对您正在使用的追逐和排序规则的一些关键依赖项——但我已将它们全部放入 SQL 代码中的注释块中。在您自己使用该功能之前,请仔细阅读此内容!
所以,这里是代码。如果你发现了一个错误,或者有我没有提到的改进,请在评论中告诉我!
delimiter $$
CREATE DEFINER=CURRENT_USER FUNCTION NatSortKey (s varchar(100), n int) RETURNS varchar(350) DETERMINISTIC
BEGIN
/****
Converts numbers in the input string s into a format such that sorting results in a nat-sort.
Numbers of up to 359 digits (before the decimal point, if one is present) are supported. Sort results are undefined if the input string contains numbers longer than this.
For n>0, only the first n numbers in the input string will be converted for nat-sort (so strings that differ only after the first n numbers will not nat-sort amongst themselves).
Total sort-ordering is preserved, i.e. if s1!=s2, then NatSortKey(s1,n)!=NatSortKey(s2,n), for any given n.
Numbers may contain ',' as a thousands separator, and '.' as a decimal point. To reverse these (as appropriate for some European locales), the code would require modification.
Numbers preceded by '+' sort with numbers not preceded with either a '+' or '-' sign.
Negative numbers (preceded with '-') sort before positive numbers, but are sorted in order of ascending absolute value (so -7 sorts BEFORE -1001).
Numbers with leading zeros sort after the same number with no (or fewer) leading zeros.
Decimal-part-only numbers (like .75) are recognised, provided the decimal point is not immediately preceded by either another '.', or by a letter-type character.
Numbers with thousand separators sort after the same number without them.
Thousand separators are only recognised in numbers with no leading zeros that don't immediately follow a ',', and when they format the number correctly.
(When not recognised as a thousand separator, a ',' will instead be treated as separating two distinct numbers).
Version-number-like sequences consisting of 3 or more numbers separated by '.' are treated as distinct entities, and each component number will be nat-sorted.
The entire entity will sort after any number beginning with the first component (so e.g. 10.2.1 sorts after both 10 and 10.995, but before 11)
Note that The first number component in an entity like this is also permitted to contain thousand separators.
To achieve this, numbers within the input string are prefixed and suffixed according to the following format:
- The number is prefixed by a 2-digit base-36 number representing its length, excluding leading zeros. If there is a decimal point, this length only includes the integer part of the number.
- A 3-character suffix is appended after the number (after the decimals if present).
- The first character is a space, or a '+' sign if the number was preceded by '+'. Any preceding '+' sign is also removed from the front of the number.
- This is followed by a 2-digit base-36 number that encodes the number of leading zeros and whether the number was expressed in comma-separated form (e.g. 1,000,000.25 vs 1000000.25)
- The value of this 2-digit number is: (number of leading zeros)*2 + (1 if comma-separated, 0 otherwise)
- For version number sequences, each component number has the prefix in front of it, and the separating dots are removed.
Then there is a single suffix that consists of a ' ' or '+' character, followed by a pair base-36 digits for each number component in the sequence.
e.g. here is how some simple sample strings get converted:
'Foo055' --> 'Foo0255 02'
'Absolute zero is around -273 centigrade' --> 'Absolute zero is around -03273 00 centigrade'
'The $1,000,000 prize' --> 'The $071000000 01 prize'
'+99.74 degrees' --> '0299.74+00 degrees'
'I have 0 apples' --> 'I have 00 02 apples'
'.5 is the same value as 0000.5000' --> '00.5 00 is the same value as 00.5000 08'
'MariaDB v10.3.0018' --> 'MariaDB v02100130218 000004'
The restriction to numbers of up to 359 digits comes from the fact that the first character of the base-36 prefix MUST be a decimal digit, and so the highest permitted prefix value is '9Z' or 359 decimal.
The code could be modified to handle longer numbers by increasing the size of (both) the prefix and suffix.
A higher base could also be used (by replacing CONV() with a custom function), provided that the collation you are using sorts the "digits" of the base in the correct order, starting with 0123456789.
However, while the maximum number length may be increased this way, note that the technique this function uses is NOT applicable where strings may contain numbers of unlimited length.
The function definition does not specify the charset or collation to be used for string-type parameters or variables: The default database charset & collation at the time the function is defined will be used.
This is to make the function code more portable. However, there are some important restrictions:
- Collation is important here only when comparing (or storing) the output value from this function, but it MUST order the characters " +0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" in that order for the natural sort to work.
This is true for most collations, but not all of them, e.g. in Lithuanian 'Y' comes before 'J' (according to Wikipedia).
To adapt the function to work with such collations, replace CONV() in the function code with a custom function that emits "digits" above 9 that are characters ordered according to the collation in use.
- For efficiency, the function code uses LENGTH() rather than CHAR_LENGTH() to measure the length of strings that consist only of digits 0-9, '.', and ',' characters.
This works for any single-byte charset, as well as any charset that maps standard ASCII characters to single bytes (such as utf8 or utf8mb4).
If using a charset that maps these characters to multiple bytes (such as, e.g. utf16 or utf32), you MUST replace all instances of LENGTH() in the function definition with CHAR_LENGTH()
Length of the output:
Each number converted adds 5 characters (2 prefix + 3 suffix) to the length of the string. n is the maximum count of numbers to convert;
This parameter is provided as a means to limit the maximum output length (to input length + 5*n).
If you do not require the total-ordering property, you could edit the code to use suffixes of 1 character (space or plus) only; this would reduce the maximum output length for any given n.
Since a string of length L has at most ((L+1) DIV 2) individual numbers in it (every 2nd character a digit), for n<=0 the maximum output length is (inputlength + 5*((inputlength+1) DIV 2))
So for the current input length of 100, the maximum output length is 350.
If changing the input length, the output length must be modified according to the above formula. The DECLARE statements for x,y,r, and suf must also be modified, as the code comments indicate.
****/
DECLARE x,y varchar(100); # need to be same length as input s
DECLARE r varchar(350) DEFAULT ''; # return value: needs to be same length as return type
DECLARE suf varchar(101); # suffix for a number or version string. Must be (((inputlength+1) DIV 2)*2 + 1) chars to support version strings (e.g. '1.2.33.5'), though it's usually just 3 chars. (Max version string e.g. 1.2. ... .5 has ((length of input + 1) DIV 2) numeric components)
DECLARE i,j,k int UNSIGNED;
IF n<=0 THEN SET n := -1; END IF; # n<=0 means "process all numbers"
LOOP
SET i := REGEXP_INSTR(s,'\\d'); # find position of next digit
IF i=0 OR n=0 THEN RETURN CONCAT(r,s); END IF; # no more numbers to process -> we're done
SET n := n-1, suf := ' ';
IF i>1 THEN
IF SUBSTRING(s,i-1,1)='.' AND (i=2 OR SUBSTRING(s,i-2,1) RLIKE '[^.\\p{L}\\p{N}\\p{M}\\x{608}\\x{200C}\\x{200D}\\x{2100}-\\x{214F}\\x{24B6}-\\x{24E9}\\x{1F130}-\\x{1F149}\\x{1F150}-\\x{1F169}\\x{1F170}-\\x{1F189}]') AND (SUBSTRING(s,i) NOT RLIKE '^\\d++\\.\\d') THEN SET i:=i-1; END IF; # Allow decimal number (but not version string) to begin with a '.', provided preceding char is neither another '.', nor a member of the unicode character classes: "Alphabetic", "Letter", "Block=Letterlike Symbols" "Number", "Mark", "Join_Control"
IF i>1 AND SUBSTRING(s,i-1,1)='+' THEN SET suf := '+', j := i-1; ELSE SET j := i; END IF; # move any preceding '+' into the suffix, so equal numbers with and without preceding "+" signs sort together
SET r := CONCAT(r,SUBSTRING(s,1,j-1)); SET s = SUBSTRING(s,i); # add everything before the number to r and strip it from the start of s; preceding '+' is dropped (not included in either r or s)
END IF;
SET x := REGEXP_SUBSTR(s,IF(SUBSTRING(s,1,1) IN ('0','.') OR (SUBSTRING(r,-1)=',' AND suf=' '),'^\\d*+(?:\\.\\d++)*','^(?:[1-9]\\d{0,2}(?:,\\d{3}(?!\\d))++|\\d++)(?:\\.\\d++)*+')); # capture the number + following decimals (including multiple consecutive '.<digits>' sequences)
SET s := SUBSTRING(s,LENGTH(x)+1); # NOTE: LENGTH() can be safely used instead of CHAR_LENGTH() here & below PROVIDED we're using a charset that represents digits, ',' and '.' characters using single bytes (e.g. latin1, utf8)
SET i := INSTR(x,'.');
IF i=0 THEN SET y := ''; ELSE SET y := SUBSTRING(x,i); SET x := SUBSTRING(x,1,i-1); END IF; # move any following decimals into y
SET i := LENGTH(x);
SET x := REPLACE(x,',','');
SET j := LENGTH(x);
SET x := TRIM(LEADING '0' FROM x); # strip leading zeros
SET k := LENGTH(x);
SET suf := CONCAT(suf,LPAD(CONV(LEAST((j-k)*2,1294) + IF(i=j,0,1),10,36),2,'0')); # (j-k)*2 + IF(i=j,0,1) = (count of leading zeros)*2 + (1 if there are thousands-separators, 0 otherwise) Note the first term is bounded to <= base-36 'ZY' as it must fit within 2 characters
SET i := LOCATE('.',y,2);
IF i=0 THEN
SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x,y,suf); # k = count of digits in number, bounded to be <= '9Z' base-36
ELSE # encode a version number (like 3.12.707, etc)
SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x); # k = count of digits in number, bounded to be <= '9Z' base-36
WHILE LENGTH(y)>0 AND n!=0 DO
IF i=0 THEN SET x := SUBSTRING(y,2); SET y := ''; ELSE SET x := SUBSTRING(y,2,i-2); SET y := SUBSTRING(y,i); SET i := LOCATE('.',y,2); END IF;
SET j := LENGTH(x);
SET x := TRIM(LEADING '0' FROM x); # strip leading zeros
SET k := LENGTH(x);
SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x); # k = count of digits in number, bounded to be <= '9Z' base-36
SET suf := CONCAT(suf,LPAD(CONV(LEAST((j-k)*2,1294),10,36),2,'0')); # (j-k)*2 = (count of leading zeros)*2, bounded to fit within 2 base-36 digits
SET n := n-1;
END WHILE;
SET r := CONCAT(r,y,suf);
END IF;
END LOOP;
END
$$
delimiter ;