unicodedata.category(unichr)
将返回该代码点的类别。
您可以在unicode.org上找到类别的描述,但与您相关的是L、N、P、Z以及可能S组:
Lu Uppercase_Letter an uppercase letter
Ll Lowercase_Letter a lowercase letter
Lt Titlecase_Letter a digraphic character, with first part uppercase
Lm Modifier_Letter a modifier letter
Lo Other_Letter other letters, including syllables and ideographs
...
您可能还想首先规范化您的字符串,以便可以附加到字母的变音符号这样做:
unicodedata.normalize(form, unistr)
返回 Unicode 字符串 unistr 的范式形式。form 的有效值为“NFC”、“NFKC”、“NFD”和“NFKD”。
把所有这些放在一起:
file_bytes = ... # However you read your input
file_text = file_bytes.decode('UTF-8')
normalized_text = unicodedata.normalize('NFC', file_text)
allowed_categories = set([
'Ll', 'Lu', 'Lt', 'Lm', 'Lo', # Letters
'Nd', 'Nl', # Digits
'Po', 'Ps', 'Pe', 'Pi', 'Pf', # Punctuation
'Zs' # Breaking spaces
])
filtered_text = ''.join(
[ch for ch in normalized_text
if unicodedata.category(ch) in allowed_categories])
filtered_bytes = filtered_text.encode('UTF-8') # ready to be written to a file