python - 用python的正则表达式清理文本文件

Question

我有一个巨大的文件，其中有这样的行：

“ En général un tr猫s bon hotel La terrasse du bar pr猫s du lobby ”

如何从文件的行中删除这些汉字字符，以便我得到一个新文件，其中这些行仅包含罗马字母字符？我正在考虑使用正则表达式。是否有所有罗马字母字符的字符类，例如阿拉伯数字、a-nA-N 和其他（标点符号）？

score 4 · Accepted Answer

我发现这个regex cheet 表在这些情况下非常方便。

# -*- coding: utf-8
import re
import string

u = u"En.!?+ 123 g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
p = re.compile(r"[^\w\s\d{}]".format(re.escape(string.punctuation)))
for m in p.finditer(u):
    print m.group()

>>> 茅
>>> 茅
>>> 猫
>>> 猫

unidecode我也是这个模块的忠实粉丝。

from unidecode import unidecode

u = u"En.!?+ 123 g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"

print unidecode(u)

>>> En.!?+ 123 gMao nMao ral un trMao s bon hotel La terrasse du bar prMao s du lobby

score 3 · Accepted Answer

您可以使用该string模块。

>>> string.ascii_letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.digits
'0123456789'
>>> string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
>>>

而且您要替换的代码似乎是中文。如果你所有的字符串都是 unicode，你可以使用简单的范围[\u4e00-\u9fa5]来替换它们。这不是中文的全部范围，但已经足够了。

>>> s = u"En g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
>>> s
u'En g\u8305n\u8305ral un tr\u732bs bon hotel La terrasse du bar pr\u732bs du lobby'
>>> import re
>>> re.sub(ur'[\u4e00-\u9fa5]', '', s)
u'En gnral un trs bon hotel La terrasse du bar prs du lobby'
>>>

score 1 · Accepted Answer

你可以在没有正则表达式的情况下做到这一点。

只保留 ascii 字符：

# -*- coding: utf-8 -*-
import unicodedata

unistr = u"En g茅n茅ral un tr猫s bon hotel La terrasse du bar pr猫s du lobby"
unistr = unicodedata.normalize('NFD', unistr) # to preserve `e` in `é`
ascii_bytes = unistr.encode('ascii', 'ignore')

要删除除 ascii 字母、数字、标点符号之外的所有内容：

from string import ascii_letters, digits, punctuation, whitespace

to_keep = set(map(ord, ascii_letters + digits + punctuation + whitespace))
all_bytes = range(0x100)
to_remove = bytearray(b for b in all_bytes if b not in to_keep)
text = ascii_bytes.translate(None, to_remove).decode()
# -> En gnral un trs bon hotel La terrasse du bar prs du lobby

python - 用python的正则表达式清理文本文件

3 回答 3

Related

Reference