python - 在带有 HFS+ 的 OSX 上的 python 中，如何获得现有文件名的正确大小写？

Question

我正在存储有关 OSX HFS+ 文件系统上存在的文件的数据。我稍后想遍历存储的数据并确定每个文件是否仍然存在。出于我的目的，我关心文件名是否区分大小写，所以如果文件名的大小写发生了变化，我会认为该文件不再存在。

我开始尝试

os.path.isfile(filename)

但是在 HFS+ 上正常安装 OSX 时，即使文件名大小写不匹配，它也会返回 True。我正在寻找一种方法来编写一个关心大小写的 isfile() 函数，即使文件系统没有。

os.path.normcase() 和 os.path.realpath() 在我传入它们的任何情况下都返回文件名。

编辑：

我现在有两个似乎适用于仅限于 ASCII 的文件名的函数。我不知道 unicode 或其他字符如何影响这一点。

第一个是基于 omz 和 Alex L 给出的答案。

def does_file_exist_case_sensitive1a(fname):
    if not os.path.isfile(fname): return False
    path, filename = os.path.split(fname)
    search_path = '.' if path == '' else path
    for name in os.listdir(search_path):
        if name == filename : return True
    return False

第二个可能效率更低。

def does_file_exist_case_sensitive2(fname):
    if not os.path.isfile(fname): return False
    m = re.search('[a-zA-Z][^a-zA-Z]*\Z', fname)
    if m:
        test = string.replace(fname, fname[m.start()], '?', 1)
        print test
        actual = glob.glob(test)
        return len(actual) == 1 and actual[0] == fname
    else:
        return True  # no letters in file, case sensitivity doesn't matter

这是基于帝斯曼答案的第三个。

def does_file_exist_case_sensitive3(fname):
    if not os.path.isfile(fname): return False
    path, filename = os.path.split(fname)
    search_path = '.' if path == '' else path
    inodes = {os.stat(x).st_ino: x for x in os.listdir(search_path)}
    return inodes[os.stat(fname).st_ino] == filename

如果我在一个目录中有数千个文件，我不认为这些会表现良好。我仍然希望有一些感觉更有效的东西。

我在测试时注意到的另一个缺点是它们只检查文件名是否匹配。如果我向他们传递一个包含目录名称的路径，那么到目前为止这些函数都没有检查目录名称的大小写。

score 6 · Accepted Answer

该答案通过提供改编自Alex L 的答案的函数来补充现有答案：

也适用于非 ASCII 字符
处理所有路径组件（不仅仅是最后一个）
使用 Python 2.x 和 3.x
作为奖励，也可以在 Windows 上工作（有更好的特定于 Windows 的解决方案 - 请参阅https://stackoverflow.com/a/2114975/45375 - 但这里的功能是跨平台的，不需要额外的包）

import os, unicodedata

def gettruecasepath(path): # IMPORTANT: <path> must be a Unicode string
  if not os.path.lexists(path): # use lexists to also find broken symlinks
    raise OSError(2, u'No such file or directory', path)
  isosx = sys.platform == u'darwin'
  if isosx: # convert to NFD for comparison with os.listdir() results
    path = unicodedata.normalize('NFD', path)
  parentpath, leaf = os.path.split(path)
  # find true case of leaf component
  if leaf not in [ u'.', u'..' ]: # skip . and .. components
    leaf_lower = leaf.lower() # if you use Py3.3+: change .lower() to .casefold()
    found = False
    for leaf in os.listdir(u'.' if parentpath == u'' else parentpath):
      if leaf_lower == leaf.lower(): # see .casefold() comment above
          found = True
          if isosx:
            leaf = unicodedata.normalize('NFC', leaf) # convert to NFC for return value
          break
    if not found:
      # should only happen if the path was just deleted
      raise OSError(2, u'Unexpectedly not found in ' + parentpath, leaf_lower)
  # recurse on parent path
  if parentpath not in [ u'', u'.', u'..', u'/', u'\\' ] and \
                not (sys.platform == u'win32' and 
                     os.path.splitdrive(parentpath)[1] in [ u'\\', u'/' ]):
      parentpath = gettruecasepath(parentpath) # recurse
  return os.path.join(parentpath, leaf)


def istruecasepath(path): # IMPORTANT: <path> must be a Unicode string
  return gettruecasepath(path) == unicodedata.normalize('NFC', path)

gettruecasepath()获取存储在指定路径（绝对或相对）路径的文件系统中的大小写精确表示（如果存在）：
- 输入路径必须是Unicode字符串：
  - Python 3.x：字符串本身就是 Unicode - 不需要额外的操作。
  - Python 2.x：文字：前缀为u; 例如，u'Motörhead'; str 变量：转换为，例如，strVar.decode('utf8')
- 返回的字符串是 NFC 中的 Unicode 字符串（组合范式）。甚至在 OSX 上也会返回 NFC，其中文件系统 (HFS+) 将名称存储在 NFD（分解的正常形式）中。
  返回 NFC，因为它比 NFD 更常见，并且 Python 不会将等效的 NFC 和 NFD 字符串识别为（概念上）相同。有关背景信息，请参见下文。
- 返回的路径保留输入路径的结构（相对与绝对，组件，如.和..），除了折叠多个路径分隔符，并且在 Windows 上，返回的路径始终\用作路径分隔符。
- 在 Windows 上，驱动器/UNC 共享组件（如果存在）按原样保留。
- 如果OSError路径不存在，或者您无权访问它，则会引发异常。
- 如果您在区分大小写的文件系统上使用此函数，例如在带有 ext4 的 Linux 上，它实际上会降级为指示输入路径是否以指定的确切大小写存在。
istruecasepath()用于gettruecasepath()将输入路径与存储在文件系统中的路径进行比较。

警告：由于这些函数需要检查输入路径（如指定）每个级别的所有目录条目，它们会很慢- 无法预测，因为性能将对应于检查的目录包含的项目数量。继续阅读背景信息。

背景

原生 API 支持（缺乏）

奇怪的是 OSX 和 Windows 都没有提供直接解决这个问题的原生 API 方法。

虽然在 Windows 上，您可以巧妙地结合两种 API 方法来解决问题，但在 OSX 上，我所知道的没有替代方案，即如上所述，在检查的路径的每一级上，目录内容的枚举速度非常慢，无法预料。

Unicode 范式：NFC 与 NFD

HFS+（OSX 的文件系统）以分解的Unicode 形式 (NFD) 存储文件名，这在将此类名称与大多数编程语言中的内存中 Unicode 字符串进行比较时会导致问题，这些语言通常采用组合的Unicode 形式 (NFC)。

例如，您在源代码ü中指定为文字的具有非 ASCII 字符的路径将表示为单个Unicode 代码点，; 这是NFC的一个示例：“C”代表组成，因为字母基础字母及其变音符号（组合分音符号）形成一个字母。U+00FCu¨

相比之下，如果您将ü其用作 HFS+文件名的一部分，它将被转换为NFD形式，从而产生2 个Unicode 代码点：基本字母u( U+0075)，然后是组合分音符号 ( ̈, U+0308) 作为单独的代码点；'D' 代表decomposed，因为字符被分解为基本字母及其相关的变音符号。

尽管 Unicode 标准认为这 2 种表示（规范）等效，但大多数编程语言，包括 Python，都不承认这种等效性。
在 Python 的情况下，您必须使用unicodedata.normalize()将两个字符串转换为相同的形式，然后再进行比较。

（旁注：Unicode正常形式与 Unicode编码是分开的，尽管不同数量的 Unicode 代码点通常也会影响编码每种形式所需的字节ü数。在上面的示例中，单代码点(NFC) 需要2 个字节来以 UTF-8 ( U+00FC-> 0xC3 0xBC) 编码，而双码点ü(NFD) 需要3 个字节（U+0075->0x75和U+0308-> 0xCC 0x88））。

score 5 · Accepted Answer

继 omz 的帖子之后——这样的事情可能会奏效：

import os

def getcase(filepath):
    path, filename = os.path.split(filepath)
    for fname in os.listdir(path):
        if filename.lower() == fname.lower():
            return os.path.join(path, fname)

print getcase('/usr/myfile.txt')

score 4 · Accepted Answer

这是我的一个疯狂想法。免责声明：我对文件系统知之甚少，无法考虑边缘情况，因此仅将其视为碰巧起作用的东西。一次。

>>> !ls
A.txt   b.txt
>>> inodes = {os.stat(x).st_ino: x for x in os.listdir(".")}
>>> inodes
{80827580: 'A.txt', 80827581: 'b.txt'}
>>> inodes[os.stat("A.txt").st_ino]
'A.txt'
>>> inodes[os.stat("a.txt").st_ino]
'A.txt'
>>> inodes[os.stat("B.txt").st_ino]
'b.txt'
>>> inodes[os.stat("b.txt").st_ino]
'b.txt'

score 2 · Accepted Answer

2

您可以使用类似的东西os.listdir并检查列表是否包含您要查找的文件名。

于 2013-01-25T04:14:08.507 回答

score 1 · Accepted Answer

这个答案只是一个概念证明，因为它不会尝试转义特殊字符、处理非 ASCII 字符或处理文件系统编码问题。

从好的方面来说，答案不涉及循环遍历 Python 中的文件，并且它正确地处理检查导致最终路径段的目录名称。

该建议基于以下观察结果（至少在使用 bash 时），/my/path当且仅当/my/path以该确切大小写存在时，以下命令才能找到没有错误的路径。

$ ls /[m]y/[p]ath

（如果括号被排除在任何路径部分之外，那么该部分将不会对大小写的变化敏感。）

这是基于此想法的示例函数：

import os.path
import subprocess

def does_exist(path):
    """Return whether the given path exists with the given casing.

    The given path should begin with a slash and not end with a trailing
    slash.  This function does not attempt to escape special characters
    and does not attempt to handle non-ASCII characters, file system
    encodings, etc.
    """
    parts = []
    while True:
        head, tail = os.path.split(path)
        if tail:
            parts.append(tail)
            path = head
        else:
            assert head == '/'
            break
    parts.reverse()
    # For example, for path "/my/path", pattern is "/[m]y/[p]ath".
    pattern = "/" + "/".join(["[%s]%s" % (p[0], p[1:]) for p in parts])
    cmd = "ls %s" % pattern
    return_code = subprocess.call(cmd, shell=True)
    return not return_code

score -2 · Accepted Answer

您也可以尝试打开该文件。

    try:open('test', 'r')
    except IOError: print 'File does not exist'

python - 在带有 HFS+ 的 OSX 上的 python 中，如何获得现有文件名的正确大小写？

6 回答 6

背景

原生 API 支持（缺乏）

Unicode 范式：NFC 与 NFD

Related

Reference