python - 去除 unicode 字符修饰符

Question

从 Python 中的 unicode 字符串中去除字符修饰符的最简单方法是什么？

例如：

A͋͠r͍̞̫̜͌ͦ̈́͐ͅt̼̭͞h́u̡̙̞̘̙̬͖͓rͬͣ̐ͮͥͨ̀͏̣应该成为亚瑟

我尝试了文档，但找不到任何可以做到这一点的东西。

score 6 · Accepted Answer

试试这个

import unicodedata
a = u"STRING GOES HERE" # using an actual string would break stackoverflow's code formatting.
u"".join( x for x in a if not unicodedata.category(x).startswith("M") )

这将删除所有归类为标记的字符，这是我认为你想要的。一般来说，你可以通过 unicodedata.category 获取一个字符的类别。

score 5 · Accepted Answer

您也可以使用正则表达式模块r'\p{M}'支持的：

import regex

def remove_marks(text):
    return regex.sub(ur"\p{M}+", "", text)

例子：

>>> print s
A͋͠r͍̞̫̜t̼̭͞h́u̡̙̞̘rͬͣ̐ͮ
>>> def remove_marks(text):
...     return regex.sub(ur"\p{M}+", "", text)
...     
... 
>>> print remove_marks(s)
Arthur

根据您的用例，白名单方法可能会更好，例如，将输入限制为 ascii 字符：

>>> s.encode('ascii', 'ignore').decode('ascii')
u'Arthur'

结果可能取决于文本中使用的 Unicode 规范化。

python - 去除 unicode 字符修饰符

2 回答 2

Related

Reference