python - Python如何检查文件名是否为UTF8？

Question

我有一个 PHP 脚本，可以在目录中创建文件列表，但是，PHP 只能看到英文的文件名，而完全忽略其他语言的文件名，例如俄语或亚洲语言。

经过大量努力，我找到了唯一适合我的解决方案 - 使用 python 脚本将文件重命名为 UTF8，因此 PHP 脚本可以在此之后处理它们。

（PHP 处理完文件后，我将文件重命名为英文，我不将它们保存为 UTF8）。

我使用了以下 python 脚本，效果很好：

import sys
import os
import glob
import ntpath
from random import randint

for infile in glob.glob( os.path.join('C:\\MyFiles', u'*') ):
    if os.path.isfile(infile):
      infile_utf8 = infile.encode('utf8')
      os.rename(infile, infile_utf8)

问题是它还转换了已经在 UTF8 中的文件名。如果文件名已经是 UTF8，我需要一种跳过转换的方法。

我正在尝试这个python脚本：

for infile in glob.glob( os.path.join('C:\\MyFiles', u'*') ):
    if os.path.isfile(infile):
      try:
        infile.decode('UTF-8', 'strict')
      except UnicodeDecodeError:
        infile_utf8 = infile.encode('utf8')
        os.rename(infile, infile_utf8)

但是，如果文件名已经在 utf8 中，我会收到致命错误：

UnicodeDecodeError: 'ascii' codec can't decode characters in position 18-20
ordinal not in range(128)

我还尝试了另一种方法，但也没有用：

for infile in glob.glob( os.path.join('C:\\MyFiles', u'*') ):
    if os.path.isfile(infile):
      try:
        tmpstr = str(infile)
      except UnicodeDecodeError:
        infile_utf8 = infile.encode('utf8')
        os.rename(infile, infile_utf8)

我得到了与以前完全相同的错误。

有任何想法吗？

Python 对我来说很新，即使是一个简单的脚本调试对我来说也是一个巨大的努力，所以请写一个明确的答案（即代码）。我没有能力测试可能有效或无效的一般想法。谢谢。

文件名示例：

 hello.txt
 你好.txt
 안녕하세요.html
 chào.doc

score 3 · Accepted Answer

对于 Python 的所有 UTF-8 问题，我强烈建议您在 PyCon 2012 上花 36 分钟观看Ned Batchelder ( http://nedbatchelder.com/text/unipain.html ) 的“Pragmatic Unicode” 。对我来说，这是一个启示！这个演示文稿中的很多内容实际上并不是特定于 Python 的，但有助于理解重要的事情，比如Unicode 字符串和UTF-8 编码字节之间的区别......

我向您推荐此视频的原因（就像我为许多朋友所做的那样）是因为您的某些代码包含矛盾，例如尝试decode然后encode如果解码失败：这些方法不能应用于同一个对象！尽管在 Python2 中它在语法上是可能的，但它没有任何意义，而在 Python 3 中，和之间的区别bytes使str事情变得更清楚：

str对象可以编码为bytes：

>>> a = 'a'
>>> type(a)
<class 'str'>
>>> a.encode
<built-in method encode of str object at 0x7f1f6b842c00>
>>> a.decode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'

...而bytes对象可以在以下位置解码str：

>>> b = b'b'
>>> type(b)
<class 'bytes'>
>>> b.decode
<built-in method decode of bytes object at 0x7f1f6b79ddc8>
>>> b.encode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'

回到您使用文件名的问题，您需要回答的棘手问题是：“文件名的编码是什么”。语言无关紧要，只有编码！

score 3 · Accepted Answer

我认为您混淆了您的术语并做出了一些错误的假设。AFAIK，PHP 可以打开任何编码类型的文件名——PHP 对编码类型非常不可知。

您还不清楚您想要实现什么作为 UTF-8 ！= 英语，并且示例外国文件名可以以多种方式编码，但绝不是 ASCII 英语！您能解释一下您认为现有的 UTF-8 文件是什么样的，以及非 UTF-8 文件是什么吗？

为了增加您的困惑，在 Windows 下，文件名透明地存储为 UTF-16。因此，您不应尝试将文件名编码为 UTF-8。相反，您应该使用 Unicode 字符串并允许 Python 进行正确的转换。（也不要使用 UTF-16 编码！）

请进一步澄清您的问题。

更新：

我现在了解您对 PHP 的问题。http://evertpot.com/filesystem-encoding-and-php/告诉我们非拉丁字符在 PHP+Windows 中很麻烦。似乎只能看到和打开由 Windows 1252 字符集字符组成的文件。

您面临的挑战是将文件名转换为与 Windows 1252 兼容。正如您在问题中所述，最好不要重命名已经兼容的文件。我已将您的尝试改写为：

import os
from glob import glob
import shutil
import urllib

files = glob(u'*.txt')
for my_file in files:
    try:
        print "File %s" % my_file
    except UnicodeEncodeError:
        print "File (escaped): %s" % my_file.encode("unicode_escape")
    new_name = my_file
    try:
        my_file.encode("cp1252" , "strict")
        print "    Name unchanged. Copying anyway"
    except UnicodeEncodeError:
        print "    Can not convert to cp1252"
        utf_8_name = my_file.encode("UTF-8")
        new_name = urllib.quote(utf_8_name )
        print "    New name: (%% encoded): %s" % new_name
    
    shutil.copy2(my_file, os.path.join("fixed", new_name))

分解：

打印文件名。默认情况下，Windows shell 仅在本地 DOS 代码页中显示结果。例如，我的 shell 可以显示ü.txt但€.txt显示为?.txt. 因此，您需要小心 Python 抛出异常，因为它无法正确打印。此代码尝试打印 Unicode 版本，但改为打印 Unicode 代码点转义。
尝试将字符串编码为 Windows-1252。如果可行，则文件名可以
否则：将文件名转换为 UTF-8，然后对其进行百分比编码。这样，文件名保持唯一，您可以在 PHP 中反转此过程。
将文件复制到新的/已验证的文件。

比如你好.txt变成%E4%BD%A0%E5%A5%BD.txt

python - Python如何检查文件名是否为UTF8？

2 回答 2

Related

Reference