python - 如何确定文本的编码？

Question

我收到了一些经过编码的文本，但我不知道使用了什么字符集。有没有办法使用 Python 确定文本文件的编码？如何检测处理 C# 的文本文件的编码/代码页。

score 255 · Accepted Answer

编辑：chardet 似乎无人管理，但大多数答案都适用。检查https://pypi.org/project/charset-normalizer/以获取替代方案

始终正确检测编码是不可能的。

（来自chardet常见问题解答：）

但是，某些编码针对特定语言进行了优化，并且语言不是随机的。一些字符序列一直弹出，而其他序列则毫无意义。一个英语流利的人打开报纸发现“txzqJv 2!dasd0a QqdKjvz”会立即认出那不是英语（即使它完全由英文字母组成）。通过研究大量“典型”文本，计算机算法可以模拟这种流利程度，并对文本的语言做出有根据的猜测。

有使用该研究来尝试检测编码的chardet库。chardet 是 Mozilla 中自动检测代码的一个端口。

您也可以使用UnicodeDammit。它将尝试以下方法：

在文档本身中发现的编码：例如，在 XML 声明或（对于 HTML 文档）http-equiv META 标记中。如果 Beautiful Soup 在文档中发现这种编码，它会重新从头开始解析文档并尝试新的编码。唯一的例外是，如果您明确指定了编码，并且该编码确实有效：那么它将忽略它在文档中找到的任何编码。
通过查看文件的前几个字节来嗅探的编码。如果在此阶段检测到编码，它将是 UTF-* 编码、EBCDIC 或 ASCII 之一。
chardet库嗅探到的编码，如果您安装了它。
UTF-8
Windows-1252

score 86 · Accepted Answer

计算编码的另一个选择是使用 libmagic （这是 file命令背后的代码）。有大量可用的 python 绑定。

文件源树中的 python 绑定可作为 python-magic（或python3-magic） debian 包使用。它可以通过执行以下操作来确定文件的编码：

import magic

blob = open('unknown-file', 'rb').read()
m = magic.open(magic.MAGIC_MIME_ENCODING)
m.load()
encoding = m.buffer(blob)  # "utf-8" "us-ascii" etc

pypi 上有一个同名但不兼容的python-magic pip 包，它也使用libmagic. 它还可以通过以下方式获取编码：

import magic

blob = open('unknown-file', 'rb').read()
m = magic.Magic(mime_encoding=True)
encoding = m.from_buffer(blob)

score 38 · Accepted Answer

一些编码策略，请取消注释以品尝：

#!/bin/bash
#
tmpfile=$1
echo '-- info about file file ........'
file -i $tmpfile
enca -g $tmpfile
echo 'recoding ........'
#iconv -f iso-8859-2 -t utf-8 back_test.xml > $tmpfile
#enca -x utf-8 $tmpfile
#enca -g $tmpfile
recode CP1250..UTF-8 $tmpfile

您可能希望通过以循环的形式打开和读取文件来检查编码......但您可能需要先检查文件大小：

# PYTHON
encodings = ['utf-8', 'windows-1250', 'windows-1252'] # add more
for e in encodings:
    try:
        fh = codecs.open('file.txt', 'r', encoding=e)
        fh.readlines()
        fh.seek(0)
    except UnicodeDecodeError:
        print('got unicode error with %s , trying different encoding' % e)
    else:
        print('opening the file with encoding:  %s ' % e)
        break

score 32 · Accepted Answer

这是一个读取和获取chardet编码预测的示例，n_lines如果文件很大，则从文件中读取。

chardet还为您提供confidence了它的编码预测概率（即chardet.predict()

def predict_encoding(file_path, n_lines=20):
    '''Predict a file's encoding using chardet'''
    import chardet

    # Open the file as binary data
    with open(file_path, 'rb') as f:
        # Join binary lines for specified number of lines
        rawdata = b''.join([f.readline() for _ in range(n_lines)])

    return chardet.detect(rawdata)['encoding']

score 11 · Accepted Answer

这可能会有所帮助

from bs4 import UnicodeDammit
with open('automate_data/billboard.csv', 'rb') as file:
   content = file.read()

suggestion = UnicodeDammit(content)
suggestion.original_encoding
#'iso-8859-1'

score 6 · Accepted Answer

# Function: OpenRead(file)

# A text file can be encoded using:
#   (1) The default operating system code page, Or
#   (2) utf8 with a BOM header
#
#  If a text file is encoded with utf8, and does not have a BOM header,
#  the user can manually add a BOM header to the text file
#  using a text editor such as notepad++, and rerun the python script,
#  otherwise the file is read as a codepage file with the 
#  invalid codepage characters removed

import sys
if int(sys.version[0]) != 3:
    print('Aborted: Python 3.x required')
    sys.exit(1)

def bomType(file):
    """
    returns file encoding string for open() function

    EXAMPLE:
        bom = bomtype(file)
        open(file, encoding=bom, errors='ignore')
    """

    f = open(file, 'rb')
    b = f.read(4)
    f.close()

    if (b[0:3] == b'\xef\xbb\xbf'):
        return "utf8"

    # Python automatically detects endianess if utf-16 bom is present
    # write endianess generally determined by endianess of CPU
    if ((b[0:2] == b'\xfe\xff') or (b[0:2] == b'\xff\xfe')):
        return "utf16"

    if ((b[0:5] == b'\xfe\xff\x00\x00') 
              or (b[0:5] == b'\x00\x00\xff\xfe')):
        return "utf32"

    # If BOM is not provided, then assume its the codepage
    #     used by your operating system
    return "cp1252"
    # For the United States its: cp1252


def OpenRead(file):
    bom = bomType(file)
    return open(file, 'r', encoding=bom, errors='ignore')


#######################
# Testing it
#######################
fout = open("myfile1.txt", "w", encoding="cp1252")
fout.write("* hi there (cp1252)")
fout.close()

fout = open("myfile2.txt", "w", encoding="utf8")
fout.write("\u2022 hi there (utf8)")
fout.close()

# this case is still treated like codepage cp1252
#   (User responsible for making sure that all utf8 files
#   have a BOM header)
fout = open("badboy.txt", "wb")
fout.write(b"hi there.  barf(\x81\x8D\x90\x9D)")
fout.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile1.txt")
L = fin.readline()
print(L)
fin.close()

# Read Example file with Bom Detection
fin = OpenRead("myfile2.txt")
L =fin.readline() 
print(L) #requires QtConsole to view, Cmd.exe is cp1252
fin.close()

# Read CP1252 with a few undefined chars without barfing
fin = OpenRead("badboy.txt")
L =fin.readline() 
print(L)
fin.close()

# Check that bad characters are still in badboy codepage file
fin = open("badboy.txt", "rb")
fin.read(20)
fin.close()

score 2 · Accepted Answer

在一般情况下，原则上不可能确定文本文件的编码。所以不，没有标准的 Python 库可以为您做到这一点。

如果您对文本文件有更具体的了解（例如，它是 XML），则可能有库函数。

score 2 · Accepted Answer

根据您的平台，我只选择使用 linux shellfile命令。这对我有用，因为我在一个专门在我们的一台 linux 机器上运行的脚本中使用它。

显然这不是一个理想的解决方案或答案，但可以对其进行修改以满足您的需求。就我而言，我只需要确定文件是否为 UTF-8。

import subprocess
file_cmd = ['file', 'test.txt']
p = subprocess.Popen(file_cmd, stdout=subprocess.PIPE)
cmd_output = p.stdout.readlines()
# x will begin with the file type output as is observed using 'file' command
x = cmd_output[0].split(": ")[1]
return x.startswith('UTF-8')

score 2 · Accepted Answer

如果您对自动工具不满意，您可以尝试所有编解码器并手动查看哪个编解码器是正确的。

all_codecs = ['ascii', 'big5', 'big5hkscs', 'cp037', 'cp273', 'cp424', 'cp437', 
'cp500', 'cp720', 'cp737', 'cp775', 'cp850', 'cp852', 'cp855', 'cp856', 'cp857', 
'cp858', 'cp860', 'cp861', 'cp862', 'cp863', 'cp864', 'cp865', 'cp866', 'cp869', 
'cp874', 'cp875', 'cp932', 'cp949', 'cp950', 'cp1006', 'cp1026', 'cp1125', 
'cp1140', 'cp1250', 'cp1251', 'cp1252', 'cp1253', 'cp1254', 'cp1255', 'cp1256', 
'cp1257', 'cp1258', 'euc_jp', 'euc_jis_2004', 'euc_jisx0213', 'euc_kr', 
'gb2312', 'gbk', 'gb18030', 'hz', 'iso2022_jp', 'iso2022_jp_1', 'iso2022_jp_2', 
'iso2022_jp_2004', 'iso2022_jp_3', 'iso2022_jp_ext', 'iso2022_kr', 'latin_1', 
'iso8859_2', 'iso8859_3', 'iso8859_4', 'iso8859_5', 'iso8859_6', 'iso8859_7', 
'iso8859_8', 'iso8859_9', 'iso8859_10', 'iso8859_11', 'iso8859_13', 
'iso8859_14', 'iso8859_15', 'iso8859_16', 'johab', 'koi8_r', 'koi8_t', 'koi8_u', 
'kz1048', 'mac_cyrillic', 'mac_greek', 'mac_iceland', 'mac_latin2', 'mac_roman', 
'mac_turkish', 'ptcp154', 'shift_jis', 'shift_jis_2004', 'shift_jisx0213', 
'utf_32', 'utf_32_be', 'utf_32_le', 'utf_16', 'utf_16_be', 'utf_16_le', 'utf_7', 
'utf_8', 'utf_8_sig']

def find_codec(text):
    for i in all_codecs:
        for j in all_codecs:
            try:
                print(i, "to", j, text.encode(i).decode(j))
            except:
                pass

find_codec("The example string which includes ö, ü, or ÄŸ, Ã¶")

此脚本至少创建 9409 行输出。因此，如果输出不适合终端屏幕，请尝试将输出写入文本文件。

score 1 · Accepted Answer

如果您知道文件的某些内容，您可以尝试使用多种编码对其进行解码，然后查看丢失的内容。一般来说，没有办法，因为文本文件是文本文件，而且这些文件很愚蠢；）

score 1 · Accepted Answer

该站点有用于识别 ascii、使用 bom 编码和 utf8 无 bom 的 python 代码： https ://unicodebook.readthedocs.io/guess_encoding.html 。将文件读入字节数组（数据）： http: //www.codecodex.com/wiki/Read_a_file_into_a_byte_array。这是一个例子。我在osx。

#!/usr/bin/python                                                                                                  

import sys

def isUTF8(data):
    try:
        decoded = data.decode('UTF-8')
    except UnicodeDecodeError:
        return False
    else:
        for ch in decoded:
            if 0xD800 <= ord(ch) <= 0xDFFF:
                return False
        return True

def get_bytes_from_file(filename):
    return open(filename, "rb").read()

filename = sys.argv[1]
data = get_bytes_from_file(filename)
result = isUTF8(data)
print(result)


PS /Users/js> ./isutf8.py hi.txt                                                                                     
True

score 0 · Accepted Answer

使用 linuxfile -i 命令

import subprocess

file = "path/to/file/file.txt"

encoding =  subprocess.Popen("file -bi "+file, shell=True, stdout=subprocess.PIPE).stdout

encoding = re.sub(r"(\\n)[^a-z0-9\-]", "", str(encoding.read()).split("=")[1], flags=re.IGNORECASE)
    
print(encoding)

score 0 · Accepted Answer

您可以使用不将整个文件加载到内存的`python-magic 包：

import magic


def detect(
    file_path,
):
    return magic.Magic(
        mime_encoding=True,
    ).from_file(file_path)

输出是编码名称，例如：

iso-8859-1
美国ASCII码
UTF-8

score 0 · Accepted Answer

您可以使用 chardet 模块

import chardet

with open (filepath , "rb") as f:
    data= f.read()
    encode=chardet.UniversalDetector()
    encode.close()
    print(encode.result)

或者您可以在 linux 中使用 chardet3 命令，但这需要一些时间：

chardet3 fileName

例子：

chardet3 donnee/dir/donnee.csv
donnee/dir/donnee.csv: ISO-8859-1 with confidence 0.73

python - 如何确定文本的编码？

14 回答 14

Related

Reference