python - Python - 获取文件的编码

Question

我很难获得文件的字符编码。有问题的代码在这里：

    rawdata = open(file, "r").read()
    encoding = chardet.detect(rawdata.encode())['encoding']
    #return encoding

（代码由 Ashish Greycube 提供：https ://github.com/frappe/frappe/pull/8061

我已将我正在处理的 csv 文件的一部分复制为更易于管理的“测试”文件。当我在上面运行上面的代码时，它说它是'ascii'。这可能是问题的一部分。基本上，我发现我需要知道这个程序的编码类型。

错误报告如下：

Traceback (most recent call last):
  File ".\program.py", line 26, in <module>
    my_encoding = get_file_encoding(data)
  File ".\program.py", line 20, in get_file_encoding
    encoding = chardet.detect(rawdata.encode())['encoding']
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\__init__.py", line 38, in detect
    detector.feed(byte_str)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\universaldetector.py", line 211, in feed
    if prober.feed(byte_str) == ProbingState.FOUND_IT:
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\charsetgroupprober.py", line 71, in feed
    state = prober.feed(byte_str)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\hebrewprober.py", line 227, in feed
    byte_str = self.filter_high_byte_only(byte_str)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\charsetprober.py", line 63, in filter_high_byte_only
    buf = re.sub(b'([\x00-\x7F])+', b' ', buf)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\re.py", line 208, in sub
    return _compile(pattern, flags).sub(repl, string, count)
MemoryError
PS C:\Users\stsho\dev\csv_sanitizer_1.2> python .\program.py
Please enter filename: ANQAR
Traceback (most recent call last):
  File ".\program.py", line 26, in <module>
    my_encoding = get_file_encoding(data)
  File ".\program.py", line 19, in get_file_encoding
    rawdata = open(file, "r").read()
FileNotFoundError: [Errno 2] No such file or directory: 'ANQAR.csv'
PS C:\Users\stsho\dev\csv_sanitizer_1.2> python .\program.py
Please enter filename: ANQAR
Traceback (most recent call last):
  File ".\program.py", line 26, in <module>
    my_encoding = get_file_encoding(data)
  File ".\program.py", line 19, in get_file_encoding
    rawdata = open(file, "r").read()
FileNotFoundError: [Errno 2] No such file or directory: 'ANQAR.csv'
PS C:\Users\stsho\dev\csv_sanitizer_1.2> python .\program.py
Please enter filename: ANQAR
Traceback (most recent call last):
  File ".\program.py", line 26, in <module>
    my_encoding = get_file_encoding(data)
  File ".\program.py", line 20, in get_file_encoding
    encoding = chardet.detect(rawdata.encode())['encoding']
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\__init__.py", line 38, in detect
    detector.feed(byte_str)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\universaldetector.py", line 211, in feed
    if prober.feed(byte_str) == ProbingState.FOUND_IT:
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\charsetgroupprober.py", line 71, in feed
    state = prober.feed(byte_str)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\hebrewprober.py", line 227, in feed
    byte_str = self.filter_high_byte_only(byte_str)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\site-packages\chardet\charsetprober.py", line 63, in filter_high_byte_only
    buf = re.sub(b'([\x00-\x7F])+', b' ', buf)
  File "C:\Users\stsho\AppData\Local\Programs\Python\Python38-32\lib\re.py", line 208, in sub
    return _compile(pattern, flags).sub(repl, string, count)
MemoryError

score 1 · Accepted Answer

这有效：

import chardet

rawdata = open(file, "rb").read()
encoding = chardet.detect(rawdata)['encoding']

score 1 · Accepted Answer

AMemoryError通常意味着您正在尝试加载对于您的内存来说太大的数据，无论是地址空间还是可用存储空间（RAM + 交换/页面文件空间）。您似乎正在运行 32 位版本的 Python，这会将您的地址空间限制为 2 GB；我建议切换到 64 位版本，因为现在大多数机器都有超过 4 GB 的 RAM，并且不使用 64 位版本意味着您不能使用其中的大部分。

附加问题：当您以文本模式读取文件时，您已经假设您知道编码。不要那样做。"rb"以二进制模式chardet（

score -1 · Accepted Answer

就像@ShadowRanger 说尝试以 64 位构建它并且不要以文本模式读取文件试试这个

enter co rawdata = open(file, "rb").read()
encoding = chardet.detect(rawdata.encode())['encoding']

并确保您的文件存在并正确写入其名称。

python - Python - 获取文件的编码

3 回答 3

Related

Reference