encoding - 从英语 => 希腊字符重命名时无法通过 Python Cgi 脚本打开文件

Question

在 Linux 系统上，文件系统存储字节，并且只存储字节。

所以，如果一个程序认为它应该发送文件名，比如说， UTF-16或者ISO-8859-7编码，它会接受一个像“Νικόλαος”这样的字符串，文件系统会看到这样的字节：

py> s = 'Νικόλαος' 
py> s.encode('UTF-16be') 
b'\x03\x9d\x03\xb9\x03\xba\x03\xcc\x03\xbb\x03\xb1\x03\xbf\x03\xc2' 

py> s.encode('iso-8859-7') 
b'\xcd\xe9\xea\xfc\xeb\xe1\xef\xf2'

请注意，相同的字符串为您提供了完全不同的字节。同样，相同的字节会给你不同的字符串，这取决于你使用的编码。

现在，如果您尝试使用需要 UTF-8 的程序读取文件名，它将看到某种 mojibake 垃圾字符，或者出现某种错误：

py> s.encode('UTF-16be').decode('utf-8') 
Traceback (most recent call last): 
  File "<stdin>", line 1, in <module> 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 1: 
invalid start byte 

py> s.encode('iso-8859-7').decode('utf-8') 
Traceback (most recent call last): 
  File "<stdin>", line 1, in <module> 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcd in position 0: 
invalid continuation byte

我在该目录中有一些文件，其中文件名作为字节在被 python3 解码为 UTF-8 时无效，但我的系统设置为使用 UTF-8。所以要解决这个问题，我需要使用一些不太关心编码的工具来重命名文件。

所以我尝试了：

mv 'EUxi tou Ihsou.mp3' 'Ευχή του Ιησού.mp3'

string = 'Ευχή του Ιησού.mp3' 
above string in unknown charset bytes = '\305\365\367\336\364\357\365\ \311\347\363\357\375.mp3'

那么如何编写以下代码才能正确让 files.py 读取希腊文件名？

# Compute a set of current fullpaths 
fullpaths = set() 
path = "/home/nikos/public_html/data/apps/" 

for root, dirs, files in os.walk(path): 
    for fullpath in files: 
            fullpaths.add( os.path.join(root, fullpath) ) 

# Load'em 
for fullpath in fullpaths: 
    try: 
            # Check the presence of a file against the database and insert if it doesn't exist 
            cur.execute('''SELECT url FROM files WHERE url = %s''', (fullpath,) ) 
            data = cur.fetchone()

Encoding = string -> (some charset used) -> charset bytes Decoding = bytes -> (必须知道使用了什么 charset) -> original string

我是否正确理解，整个编码/解码过程的关键是使用/必须使用的字符集？

我们不知道他们使用了 key(charset)，但我们知道字符串的原始形式，所以我想到如果我们编写一个 python 脚本来将 mojabike 字节流解码为所有可用的字符集，那么原始字符串将出现回来。

encoding - 从英语 => 希腊字符重命名时无法通过 Python Cgi 脚本打开文件

0 回答 0

Related

Reference