python - 如何使 Django slugify 与 Unicode 字符串一起正常工作？

Question

我可以做些什么来防止slugify过滤器去除非 ASCII 字母数字字符？（我正在使用 Django 1.0.2）

cnprog.com有问题 URL 中的汉字，所以我查看了他们的代码。他们没有slugify在模板中使用，而是在Question模型中调用此方法以获取永久链接

def get_absolute_url(self):
    return '%s%s' % (reverse('question', args=[self.id]), self.title)

他们是否对 URL 进行了处理？

score 100 · Accepted Answer

我为 askbot Q&A 论坛采用了一个名为unidecode的 python 包，它适用于基于拉丁语的字母表，甚至对于希腊语看起来也很合理：

>>> import unidecode
>>> from unidecode import unidecode
>>> unidecode(u'διακριτικός')
'diakritikos'

它对亚洲语言做了一些奇怪的事情：

>>> unidecode(u'影師嗎')
'Ying Shi Ma '
>>>

这有意义吗？

在 askbot 中，我们像这样计算 slug：

from unidecode import unidecode
from django.template import defaultfilters
slug = defaultfilters.slugify(unidecode(input_text))

score 24 · Accepted Answer

Mozilla 网站团队一直致力于实现： https ://github.com/mozilla/unicode-slugify 示例代码 http://davedash.com/2011/03/24/how-we-slug-at-mozilla /

score 23 · Accepted Answer

使用Django >= 1.9，django.utils.text.slugify有一个allow_unicode参数：

>>> slugify("你好 World", allow_unicode=True)
"你好-world"

如果您使用 Django <= 1.8（自 2018 年 4 月起您不应该使用它），您可以从 Django 1.9 获取代码。

score 15 · Accepted Answer

此外， slugify 的 Django 版本不使用 re.UNICODE 标志，因此它甚至不会尝试理解\w\s它与非 ascii 字符有关的含义。

这个自定义版本对我来说效果很好：

def u_slugify(txt):
        """A custom version of slugify that retains non-ascii characters. The purpose of this
        function in the application is to make URLs more readable in a browser, so there are 
        some added heuristics to retain as much of the title meaning as possible while 
        excluding characters that are troublesome to read in URLs. For example, question marks 
        will be seen in the browser URL as %3F and are thereful unreadable. Although non-ascii
        characters will also be hex-encoded in the raw URL, most browsers will display them
        as human-readable glyphs in the address bar -- those should be kept in the slug."""
        txt = txt.strip() # remove trailing whitespace
        txt = re.sub('\s*-\s*','-', txt, re.UNICODE) # remove spaces before and after dashes
        txt = re.sub('[\s/]', '_', txt, re.UNICODE) # replace remaining spaces with underscores
        txt = re.sub('(\d):(\d)', r'\1-\2', txt, re.UNICODE) # replace colons between numbers with dashes
        txt = re.sub('"', "'", txt, re.UNICODE) # replace double quotes with single quotes
        txt = re.sub(r'[?,:!@#~`+=$%^&\\*()\[\]{}<>]','',txt, re.UNICODE) # remove some characters altogether
        return txt

注意最后一个正则表达式替换。这是更健壮的 expression 问题的一种解决方法r'\W'，它似乎要么去掉一些非 ascii 字符，要么错误地重新编码它们，如以下 python 解释器会话所示：

Python 2.5.1 (r251:54863, Jun 17 2009, 20:37:34) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> # Paste in a non-ascii string (simplified Chinese), taken from http://globallives.org/wiki/152/
>>> str = '您認識對全球社區感興趣的中國攝影師嗎'
>>> str
'\xe6\x82\xa8\xe8\xaa\x8d\xe8\xad\x98\xe5\xb0\x8d\xe5\x85\xa8\xe7\x90\x83\xe7\xa4\xbe\xe5\x8d\x80\xe6\x84\x9f\xe8\x88\x88\xe8\xb6\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print str
您認識對全球社區感興趣的中國攝影師嗎
>>> # Substitute all non-word characters with X
>>> re_str = re.sub('\W', 'X', str, re.UNICODE)
>>> re_str
'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print re_str
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX?的中國攝影師嗎
>>> # Notice above that it retained the last 7 glyphs, ostensibly because they are word characters
>>> # And where did that question mark come from?
>>> 
>>> 
>>> # Now do the same with only the last three glyphs of the string
>>> str = '影師嗎'
>>> print str
影師嗎
>>> str
'\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> re.sub('\W','X',str,re.U)
'XXXXXXXXX'
>>> re.sub('\W','X',str)
'XXXXXXXXX'
>>> # Huh, now it seems to think those same characters are NOT word characters

我不确定上面的问题是什么，但我猜它源于“在 Unicode 字符属性数据库中被归类为字母数字的任何内容”以及它是如何实现的。我听说 python 3.x 对更好的 unicode 处理有很高的优先级，所以这可能已经修复了。或者，也许这是正确的 python 行为，我在滥用 unicode 和/或中文。

目前，一种解决方法是避免使用字符类，并根据明确定义的字符集进行替换。

score 9 · Accepted Answer

恐怕 django 对 slug 的定义意味着 ascii，尽管 django 文档没有明确说明这一点。这是 slugify 的默认过滤器的来源......您可以看到这些值正在转换为 ascii，如果出现错误，请使用“忽略”选项：

import unicodedata
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore')
value = unicode(re.sub('[^\w\s-]', '', value).strip().lower())
return mark_safe(re.sub('[-\s]+', '-', value))

基于此，我猜 cnprog.com 没有使用官方slugify功能。如果您想要不同的行为，您可能希望调整上面的 django 片段。

尽管如此，URL 的 RFC 确实声明非 us-ascii 字符（或者，更具体地说，除了字母数字和 $-_.+!*'() 之外的任何字符）应该使用 %hex 表示法进行编码. 如果您查看浏览器发送的实际原始 GET 请求（例如，使用 Firebug），您会发现中文字符实际上是在发送之前编码的……浏览器只是让它在显示中看起来很漂亮。我怀疑这就是为什么 slugify 只坚持 ascii，fwiw。

score 8 · Accepted Answer

你可能想看看： https ://github.com/un33k/django-uuslug

它将为您处理两个“U”。唯一的U和 Unicode中的U。

它将为您轻松完成这项工作。

score 4 · Accepted Answer

这就是我使用的：

http://trac.django-fr.org/browser/site/trunk/djangofr/links/slughifi.py

SlugHiFi 是常规 slugify 的包装器，不同之处在于它将国家字符替换为对应的英文字母。

所以你得到的不是“Ą”，而是“A”，而不是“Ł”=>“L”，等等。

score 2 · Accepted Answer

我有兴趣在 slug 中只允许 ASCII 字符，这就是为什么我尝试对相同字符串的一些可用工具进行基准测试：

Unicode Slugify：

In [5]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o', only_ascii=True)
37.8 µs ± 86.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

'paizo-trekho-kai-glo-la-fdo'

姜戈乌斯鲁格：

In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
35.3 µs ± 303 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

'paizo-trekho-kai-g-lo-la-fd-o'

令人敬畏的 Slugify：

In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
47.1 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

'Paizo-trekho-kai-g-lo-la-fd-o'

Python Slugify：

In [3]: %timeit slugify('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o')
24.6 µs ± 122 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

'paizo-trekho-kai-g-lo-la-fd-o'

django.utils.text.slugify使用统一解码：

In [15]: %timeit slugify(unidecode('Παίζω τρέχω %^&*@# και γ%^(λώ la fd/o'))
36.5 µs ± 89.7 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

'paizo-trekho-kai-glo-la-fdo'

python - 如何使 Django slugify 与 Unicode 字符串一起正常工作？

8 回答 8

Related

Reference