python - 在主要是拉丁语 1 文件中定位非拉丁语 1 文本的片段？

Question

我相信英文 .txt 是 Latin-1，但它可能包含另一种编码的片段。是否有库或工具来定位这些片段？

我知道 Pythonchardat库之类的东西，但我专门寻找一种工具来测试 Latin-1 文件并检测异常。即使是常规检测库也可以，如果它能够告诉我它检测到非拉丁 1 模式的点并给我索引。

命令行工具和 Python 库特别受欢迎。

score 0 · Accepted Answer

Latin-1（或者您的意思是它的带有欧元符号的 latin-15 变体？）不是那么容易检测到的。

简单的方法可能是检查是否确实使用了一些未使用的字符（请参见此处的表格） - 如果有，则说明有问题。但是，要检测更细微的违规行为，需要实际检查该语言是否是其中一种，使用 latin-1。否则，无法区分 8 位编码。最好不要一开始就混合 8 位编码，而不以某种方式标记编码的变化......

score 0 · Accepted Answer

您认为文件 (1) 是 Latin-1 (2) 可能包含另一种编码的片段的理由是什么？文件有多大？什么是“常规检测库”？您是否考虑过它可能是 Windows 编码（例如 cp1252）的可能性？

一些粗略的诊断：

# preliminaries
text = open('the_file.txt', 'rb').read()
print len(text), "bytes in file"

# How many non-ASCII bytes?
print sum(1 for c in text if c > '\x7f'), "non-ASCII bytes"

# Will it decode as UTF-8 OK?
try:
    junk = text.decode('utf8')
    print "utf8 decode OK"
except UnicodeDecodeError, e:
    print e

# Runs of more than one non-ASCII byte are somewhat rare in single-byte encodings
# of languages written in a Latin script ...
import re
runs = re.findall(r'[\x80-\xff]+', text)
nruns = len(runs)
print nruns, "runs of non-ASCII bytes"
if nruns:
    avg_rlen = sum(len(run) for run in runs) / float(nruns)
    print "average run length: %.2f bytes" % avg_rlen
# then if indicated you could write some code to display runs in context ...

python - 在主要是拉丁语 1 文件中定位非拉丁语 1 文本的片段？

2 回答 2

Related

Reference