python - Python 3.9.2 无法删除搞笑字符“½”

Question

#below是在Python 2.6.6版本中删除搞笑字符“½”的步骤，效果很好。

#-*- coding: utf-8 -*- 

import os,glob

funny=glob.glob('C:\A\Text\*')   #This folder has 10 files, so i use '*' for a loop

for h in funny:
    with open(r'%s' %h, 'r') as infile,open(r'%sN' %h, 'w') as outfile:
        data = infile.read()
        data = data.replace ("13½","13")
        data = data.decode("ascii", "ignore")
        outfile.write(data)
        infile.close()
        outfile.close()
        os.remove(h)
        os.rename(r'%sN' %h,r'%s' %h)

但是现在我们升级到 3.9.2 版本，这不起作用，它显示以下错误消息：

回溯（最近一次调用）：文件“C:/A/test.py”，第 10 行，在 data = infile.read() 文件“C:\Program Files\Python39\lib\encodings\cp1252.py”中，第 23 行，在 decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10871: character maps to

我搜索了很多，新版本中没有用那个“½”替换功能，有什么想法吗？

score 1 · Accepted Answer

Python 3 需要知道输入文件的编码。根据回溯，它似乎默认为 cp1252，但显然这是不正确的。我找不到该字符实际映射到您问题中的字形的编码；有关Python 3.6.8 支持的编码列表，请参阅https://tripleee.github.io/8bit/#9d （披露：我自己的资源）。（在 3.9 中应该没有太大变化。）

希望丢弃您不知道如何处理的数据通常只是一种绝望的解决方法，其中正确的解决方案是了解数据代表什么，如果确实错误，则在源头修复错误，或者正确处理而不是删除它。

但这是您的代码的修复程序。

for h in glob.glob(r'C:\A\Text\*'):
    dest = '%sN' % h
    with open(h, 'r', encoding='latin-1') as infile, open(dest, 'w', encoding='latin-1') as outfile:
        for line in infile:
            line = line.replace("13\x9d", "13")
            outfile.write(line)
    os.remove(h)
    os.rename(dest, h)

Latin-1 编码在这里可能不完全正确，但只要您使用相同的编码进行读写，并且所有字符代码都在该编码中定义（因为它们方便地在 Latin-1 中）结果应该成为你所期望的。

我还重构为一次读取一行，而不是将整个文件放入内存中；如果您有足够的 RAM，则无关紧要，但如果您可能有大文件，这也应该会提高鲁棒性。如果文件不是真正的文本文件，则可能回滚该更改（但无论如何您可能会遇到不同的问题）。

python - Python 3.9.2 无法删除搞笑字符“½”

1 回答 1

Related

Reference