python - 如何使用 Python 获得“合理”的字符串排序？

Question

我有需要从长列表中选择元素的用户（主要是德国人）。我将实现自动完成，但我也想按照他们期望的顺序向他们展示元素。我向几个用户询问了典型的字符串来对它们进行排序，发现它（大部分）是一致的。但是，很难实现这种排序：

user_expectation(l)                               "        <        @        1        2        10       10abc    A        e        é        E        Z
sorted(l)                                         "        1        10       10abc    2        <        @        A        E        Z        e        é
sorted(l, key=lambda w: w.lower())                "        1        10       10abc    2        <        @        A        e        E        Z        é
ns.natsorted(l)                                   1        2        10       10abc    "        <        @        A        E        Z        e        é
ns.natsorted(l, alg=ns.I)                         1        2        10       10abc    "        <        @        A        E        Z        e        é
ns.natsorted(l, alg=ns.LOCALE | ns.LF | ns.G)     1        2        10       10abc    "        <        @        A        E        e        Z        é
ns.natsorted(l, alg=ns.LOCALE | ns.LF | ns.G), en 1        2        10       10abc    <        "        @        A        E        e        é        Z
ns.natsorted(l, alg=ns.LOCALE | ns.LF | ns.G), de 1        2        10       10abc    <        "        @        A        E        e        é        Z
ns.natsorted(l, alg=ns.LF | ns.G), de             1        2        10       10abc    "        <        @        A        e        E        Z        é

因此：

特殊字符优先 - 顺序并不重要，只要它是一致的
接下来是数字。通过匹配前缀对数字进行数字排序（因此 ['1', '10', '2']）
字符（Latin1？）
- 先不重音
- 大写之前的小写（尽管这可能不是那么重要
- 稍后重音/特殊的

代码

# -*- coding: utf-8 -*-

from __future__ import unicode_literals
import natsort as ns
import locale

def custom_print(name, l):
    s = u"{:<50}".format(name)
    for el in l:
        s += u"{:<5}\t".format(el)
    print(u"\t" + s.strip())

l = ['"', "<", "@", "1", "2", "10", "10abc", "A", "e", "é", "E", "Z"]
custom_print("user_expectation(l)", l)
custom_print("sorted(l)", sorted(l))
custom_print("sorted(l, key=lambda w: w.lower())",
             sorted(l, key=lambda w: w.lower()))
custom_print("ns.natsorted(l)", ns.natsorted(l))
custom_print("ns.natsorted(l, alg=ns.I)", ns.natsorted(l, alg=ns.I))
custom_print("ns.natsorted(l, alg=ns.LOCALE | ns.LF | ns.G)",
             ns.natsorted(l, alg=ns.LOCALE | ns.LF | ns.G))
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
custom_print("ns.natsorted(l, alg=ns.LOCALE | ns.LF | ns.G), en",
             ns.natsorted(l, alg=ns.LOCALE | ns.LF | ns.G))
locale.setlocale(locale.LC_ALL, 'de_DE.UTF-8')
custom_print("ns.natsorted(l, alg=ns.LOCALE | ns.LF | ns.G), de",
             ns.natsorted(l, alg=ns.LOCALE | ns.LF | ns.G))
custom_print("ns.natsorted(l, alg=ns.LF | ns.G), de",
             ns.natsorted(l, alg=ns.LF | ns.G))

natsort使用IGNORECASE, LOWERCASEFIRST, LOCALE（de 或 en），GROUP 标志非常接近。我不喜欢的是特殊字符在数字之后。有没有办法解决它？（而且LF似乎没有效果）

score 1 · Accepted Answer

从natsort版本 >= 5.1.0 开始，重音字符应开箱即用地处理。

这是一种在数字之前获取特殊字符的方法。

import re
import natsort as ns

def special_chars_first(x):
    '''Ensure special characters are sorted first.'''
    # You can add error handling here if needed.
    # If you need '_' to be considered a special character,
    # use [0-9A-Za-z] instead of \W.
    return re.sub(r'^(\W)', r'0\1', x)
    # An alternate, less-hacky solution.
    #if re.match(r'\W', x):
    #    return float('-inf'), x
    #else:
    #    return float('inf'), x

l = ['"', "<", "@", "1", "2", "10", "10abc", "A", "e", "é", "E", "Z"]
print(ns.natsorted(l, key=special_chars_first, alg=ns.G | ns.LF))

输出

['"', '<', '@', '1', '2', '10', '10abc', 'A', 'e', 'é', 'E', 'Z']

这通过在任何以非单词字符（定义为除字母、数字或之外的任何字符'_'）开头的字符串前加上 a 前缀来工作，'0'这将保证它们在任何其他数字之前结束（并且根据现在的工作，数字总是首先出现natsort）。

score 0 · Accepted Answer

您可以使用该key参数将特殊字符替换为在数字之前排序的字符（例如空格）。

sorted(or natsorted) 然后将按修改后的字符串排序，但仍会返回原始字符串。

一个简化的示例，仅处理特殊字符与数字。

import re
def replace_special(s):
  # add more characters to regex, as required
  return re.sub('[<@]', ' ', s)

l = ['"', "<", "@", "1", "10", "2", "10abc", "A", "e", "é", "E", "Z"]

sorted(l, key=replace_special)
['<', '@', '"', '1', '10', '10abc', '2', 'A', 'E', 'Z', 'e', 'é']

python - 如何使用 Python 获得“合理”的字符串排序？

代码

2 回答 2

Related

Reference