python - 剥离非字符的 unicode 文本

Question

我正在尝试编写一个简单的 Python 脚本，它将文本文件作为输入，删除每个非文字字符，并将输出写入另一个文件。通常我会做两种方式：

使用正则表达式结合re.sub用空字符串替换每个非字母字符
检查每一行中的每个字符，并仅当它在输出中时才将其写入输出string.lowercase

但是这次的文字是意大利语的神曲（我是意大利人），所以有一些Unicode字符，比如

èéï

和其他一些。我写# -*- coding: utf-8 -*-在脚本的第一行，但我得到的是当 Unicode 字符写入脚本时，Python 不会发出错误信号。

然后我尝试在我的正则表达式中包含 Unicode 字符，例如：

u'\u00AB'

它似乎可以工作，但是 Python 在从文件中读取输入时，不会以与读取文件相同的方式重写它读取的内容。例如，某些字符被转换为平方根符号。

我该怎么办？

score 2 · Accepted Answer

unicodedata.category(unichr)将返回该代码点的类别。

您可以在unicode.org上找到类别的描述，但与您相关的是L、N、P、Z以及可能S组：

Lu    Uppercase_Letter    an uppercase letter
Ll    Lowercase_Letter    a lowercase letter
Lt    Titlecase_Letter    a digraphic character, with first part uppercase
Lm    Modifier_Letter a modifier letter
Lo    Other_Letter    other letters, including syllables and ideographs
...

您可能还想首先规范化您的字符串，以便可以附加到字母的变音符号这样做：

unicodedata.normalize(form, unistr)

返回 Unicode 字符串 unistr 的范式形式。form 的有效值为“NFC”、“NFKC”、“NFD”和“NFKD”。

把所有这些放在一起：

file_bytes = ...   # However you read your input
file_text = file_bytes.decode('UTF-8')
normalized_text = unicodedata.normalize('NFC', file_text)
allowed_categories = set([
    'Ll', 'Lu', 'Lt', 'Lm', 'Lo',  # Letters
    'Nd', 'Nl',                    # Digits
    'Po', 'Ps', 'Pe', 'Pi', 'Pf',  # Punctuation
    'Zs'                           # Breaking spaces
])
filtered_text = ''.join(
    [ch for ch in normalized_text
     if unicodedata.category(ch) in allowed_categories])
filtered_bytes = filtered_text.encode('UTF-8')  # ready to be written to a file

score 0 · Accepted Answer

import codecs
f = codecs.open('FILENAME', encoding='utf-8')
for line in f:
    print repr(line)
    print line

1. 会给你 Unicode 格式
2. 会给你写在你的文件里。

希望它会帮助你:)

python - 剥离非字符的 unicode 文本

2 回答 2

Related

Reference