此外, slugify 的 Django 版本不使用 re.UNICODE 标志,因此它甚至不会尝试理解\w\s
它与非 ascii 字符有关的含义。
这个自定义版本对我来说效果很好:
def u_slugify(txt):
"""A custom version of slugify that retains non-ascii characters. The purpose of this
function in the application is to make URLs more readable in a browser, so there are
some added heuristics to retain as much of the title meaning as possible while
excluding characters that are troublesome to read in URLs. For example, question marks
will be seen in the browser URL as %3F and are thereful unreadable. Although non-ascii
characters will also be hex-encoded in the raw URL, most browsers will display them
as human-readable glyphs in the address bar -- those should be kept in the slug."""
txt = txt.strip() # remove trailing whitespace
txt = re.sub('\s*-\s*','-', txt, re.UNICODE) # remove spaces before and after dashes
txt = re.sub('[\s/]', '_', txt, re.UNICODE) # replace remaining spaces with underscores
txt = re.sub('(\d):(\d)', r'\1-\2', txt, re.UNICODE) # replace colons between numbers with dashes
txt = re.sub('"', "'", txt, re.UNICODE) # replace double quotes with single quotes
txt = re.sub(r'[?,:!@#~`+=$%^&\\*()\[\]{}<>]','',txt, re.UNICODE) # remove some characters altogether
return txt
注意最后一个正则表达式替换。这是更健壮的 expression 问题的一种解决方法r'\W'
,它似乎要么去掉一些非 ascii 字符,要么错误地重新编码它们,如以下 python 解释器会话所示:
Python 2.5.1 (r251:54863, Jun 17 2009, 20:37:34)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> # Paste in a non-ascii string (simplified Chinese), taken from http://globallives.org/wiki/152/
>>> str = '您認識對全球社區感興趣的中國攝影師嗎'
>>> str
'\xe6\x82\xa8\xe8\xaa\x8d\xe8\xad\x98\xe5\xb0\x8d\xe5\x85\xa8\xe7\x90\x83\xe7\xa4\xbe\xe5\x8d\x80\xe6\x84\x9f\xe8\x88\x88\xe8\xb6\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print str
您認識對全球社區感興趣的中國攝影師嗎
>>> # Substitute all non-word characters with X
>>> re_str = re.sub('\W', 'X', str, re.UNICODE)
>>> re_str
'XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX\xa3\xe7\x9a\x84\xe4\xb8\xad\xe5\x9c\x8b\xe6\x94\x9d\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> print re_str
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX?的中國攝影師嗎
>>> # Notice above that it retained the last 7 glyphs, ostensibly because they are word characters
>>> # And where did that question mark come from?
>>>
>>>
>>> # Now do the same with only the last three glyphs of the string
>>> str = '影師嗎'
>>> print str
影師嗎
>>> str
'\xe5\xbd\xb1\xe5\xb8\xab\xe5\x97\x8e'
>>> re.sub('\W','X',str,re.U)
'XXXXXXXXX'
>>> re.sub('\W','X',str)
'XXXXXXXXX'
>>> # Huh, now it seems to think those same characters are NOT word characters
我不确定上面的问题是什么,但我猜它源于“在 Unicode 字符属性数据库中被归类为字母数字的任何内容”以及它是如何实现的。我听说 python 3.x 对更好的 unicode 处理有很高的优先级,所以这可能已经修复了。或者,也许这是正确的 python 行为,我在滥用 unicode 和/或中文。
目前,一种解决方法是避免使用字符类,并根据明确定义的字符集进行替换。