python - 在 Python 中将 unicode 文本规范化为文件名等

Question

是否有任何独立的解决方案可以将国际 unicode 文本标准化为 Python 中的安全 ID 和文件名？

例如My International Text: åäö转向my-international-text-aao

plone.i18n确实做得很好，但不幸的是，它依赖于zope.securityandzope.publisher和其他一些包，使其依赖脆弱。

score 35 · Accepted Answer

你想要做的也被称为“slugify”一个字符串。这是一个可能的解决方案：

import re
from unicodedata import normalize

_punct_re = re.compile(r'[\t !"#$%&\'()*\-/<=>?@\[\\\]^_`{|},.:]+')

def slugify(text, delim=u'-'):
    """Generates an slightly worse ASCII-only slug."""
    result = []
    for word in _punct_re.split(text.lower()):
        word = normalize('NFKD', word).encode('ascii', 'ignore')
        if word:
            result.append(word)
    return unicode(delim.join(result))

用法：

>>> slugify(u'My International Text: åäö')
u'my-international-text-aao'

您还可以更改分隔符：

>>> slugify(u'My International Text: åäö', delim='_')
u'my_international_text_aao'

资料来源： 产生蛞蝓

对于 Python 3： pastebin.com/ft7Yb3KS（感谢@MrPoxipol）。

score 4 · Accepted Answer

解决这个问题的方法是决定允许哪些字符（不同的系统对有效标识符有不同的规则。

一旦你决定了哪些字符是被允许的，写一个allowed()谓词和一个 dict 子类用于str.translate：

def makesafe(text, allowed, substitute=None):
    ''' Remove unallowed characters from text.
        If *substitute* is defined, then replace
        the character with the given substitute.
    '''
    class D(dict):
        def __getitem__(self, key):
            return key if allowed(chr(key)) else substitute
    return text.translate(D())

这个功能非常灵活。它让您可以轻松地指定规则来决定保留哪些文本以及替换或删除哪些文本。

这是一个使用规则的简单示例，“只允许 unicode 类别 L 中的字符”：

import unicodedata

def allowed(character):
    return unicodedata.category(character).startswith('L')

print(makesafe('the*ides&of*march', allowed, '_'))
print(makesafe('the*ides&of*march', allowed))

该代码产生如下安全输出：

the_ides_of_march
theidesofmarch

score 2 · Accepted Answer

以下将从 Unicode 可以分解为组合对的任何字符中删除重音，丢弃它不能丢弃的任何奇怪字符，并删除空格：

# encoding: utf-8
from unicodedata import normalize
import re

original = u'ľ š č ť ž ý á í é'
decomposed = normalize("NFKD", original)
no_accent = ''.join(c for c in decomposed if ord(c)<0x7f)
no_spaces = re.sub(r'\s', '_', no_accent)

print no_spaces
# output: l_s_c_t_z_y_a_i_e

它不会尝试删除文件系统上不允许的字符，但您可以DANGEROUS_CHARS_REGEX从为此链接的文件中窃取。

score 2 · Accepted Answer

我也会在这里提出我自己的（部分）解决方案：

import unicodedata

def deaccent(some_unicode_string):
    return u''.join(c for c in unicodedata.normalize('NFD', some_unicode_string)
               if unicodedata.category(c) != 'Mn')

这并不能满足您的所有要求，但提供了一些方便的方法中包含的一些不错的技巧：unicode.normalise('NFD', some_unicode_string)对 unicode 字符进行分解，例如，它将 'ä' 分解为两个 unicode 代码点U+03B3和U+0308.

另一种方法，unicodedata.category(char)，返回该特定的编码字符类别char。类别Mn包含所有组合重音，因此deaccent从单词中删除所有重音。

但请注意，这只是部分解决方案，它消除了重音。在此之后，您仍然需要某种要允许的字符白名单。

score 0 · Accepted Answer

我会去

https://pypi.python.org/pypi?%3Aaction=search&term=slug

Its hard to come up with a scenario where one of these does not fit your needs.

python - 在 Python 中将 unicode 文本规范化为文件名等

5 回答 5

Related

Reference