python - 使用 Python 读取元数据

Question

在过去的两天里，我一直在扫描互联网，试图找到解决问题的方法。我有一个包含不同文件的文件夹。他们运行文件类型的策略。我正在尝试编写一个 python 脚本，该脚本将从每个文件中读取元数据（如果存在）。目的是最终将数据输出到文件中，以便与另一个程序的元数据提取进行比较。

我发现了一些示例，它适用于目录中的极少数文件。我发现的所有方法都涉及打开存储容器对象。我是 Python 新手，不确定什么是存储容器对象。我只知道我的大部分文件在尝试使用时都会出错

pythoncom.StgOpenStorage(<File Name>, None, flags)

使用少数实际工作的标签，我可以获得主要的元数据标签，如标题、主题、作者、已创建等。

有谁知道存储容器以外的方法来获取元数据？此外，如果有更简单的方法可以使用另一种语言来做到这一点，请务必提出建议。

谢谢

score 13 · Accepted Answer

您可以使用 Shell com 对象检索资源管理器中可见的任何元数据：

import win32com.client
sh=win32com.client.gencache.EnsureDispatch('Shell.Application',0)
ns = sh.NameSpace(r'm:\music\Aerosmith\Classics Live!')
colnum = 0
columns = []
while True:
    colname=ns.GetDetailsOf(None, colnum)
    if not colname:
        break
    columns.append(colname)
    colnum += 1

for item in ns.Items():
    print (item.Path)
    for colnum in range(len(columns)):
        colval=ns.GetDetailsOf(item, colnum)
        if colval:
            print('\t', columns[colnum], colval)

score 4 · Accepted Answer

我决定编写自己的答案，以尝试结合和澄清上述答案（这极大地帮助了我解决了我的问题）。

我想说有两种方法可以解决这个问题。

情况1：您知道文件包含哪些元数据（您对哪些元数据感兴趣）。

在这种情况下，假设您有一个字符串列表，其中包含您感兴趣的元数据。我在这里假设这些标签是正确的（即您对 .txt 文件的像素数不感兴趣）。

metadata = ['Name', 'Size', 'Item type', 'Date modified', 'Date created']

现在，使用 Greedo 和 Roger Upole 提供的代码，我创建了一个函数，该函数分别接受文件的完整路径和名称，并返回包含感兴趣的元数据的字典：

def get_file_metadata(path, filename, metadata):
    # Path shouldn't end with backslash, i.e. "E:\Images\Paris"
    # filename must include extension, i.e. "PID manual.pdf"
    # Returns dictionary containing all file metadata.
    sh = win32com.client.gencache.EnsureDispatch('Shell.Application', 0)
    ns = sh.NameSpace(path)

    # Enumeration is necessary because ns.GetDetailsOf only accepts an integer as 2nd argument
    file_metadata = dict()
    item = ns.ParseName(str(filename))
    for ind, attribute in enumerate(metadata):
        attr_value = ns.GetDetailsOf(item, ind)
        if attr_value:
            file_metadata[attribute] = attr_value

    return file_metadata

# *Note: you must know the total path to the file.*
# Example usage:
if __name__ == '__main__':
    folder = 'E:\Docs\BMW'
    filename = 'BMW series 1 owners manual.pdf'
    metadata = ['Name', 'Size', 'Item type', 'Date modified', 'Date created']
    print(get_file_metadata(folder, filename, metadata))

结果：

{'Name': 'BMW series 1 owners manual.pdf', 'Size': '11.4 MB', 'Item type': 'Foxit Reader PDF Document', 'Date modified': '8/30/2020 11:10 PM', 'Date created': '8/30/2020 11:10 PM'}

这是正确的，因为我刚刚创建了文件，并且我使用 Foxit PDF 阅读器作为我的主要 pdf 阅读器。所以这个函数返回一个字典，其中键是元数据标签，值是给定文件的这些标签的值。

情况 2：您不知道文件包含哪些元数据

这是一个更艰难的情况，尤其是在最优性方面。我分析了 Roger Upole 提出的代码，基本上，他试图读取None文件的元数据，这导致他获得了所有可能的元数据标签的列表。所以我认为硬拷贝这个列表然后尝试读取每个标签可能更容易。这样，一旦完成，您将拥有一个包含文件实际拥有的所有标签的字典。

只需复制我认为所有可能的元数据标签，然后尝试从文件中获取所有标签。基本上，只需复制这个 python 列表的声明，并使用上面的代码（用这个新列表替换元数据）：

metadata = ['Name', 'Size', 'Item type', 'Date modified', 'Date created', 'Date accessed', 'Attributes', 'Offline status', 'Availability', 'Perceived type', 'Owner', 'Kind', 'Date taken', 'Contributing artists', 'Album', 'Year', 'Genre', 'Conductors', 'Tags', 'Rating', 'Authors', 'Title', 'Subject', 'Categories', 'Comments', 'Copyright', '#', 'Length', 'Bit rate', 'Protected', 'Camera model', 'Dimensions', 'Camera maker', 'Company', 'File description', 'Masters keywords', 'Masters keywords']

我不认为这是一个很好的解决方案，但另一方面，您可以将此列表保留为全局变量，然后使用它而无需将其传递给每个函数调用。为了完整起见，这里是使用这个新元数据列表的前一个函数的输出：

{'Name': 'BMW series 1 owners manual.pdf', 'Size': '11.4 MB', 'Item type': 'Foxit Reader PDF Document', 'Date modified': '8/30/2020 11:10 PM', 'Date created': '8/30/2020 11:10 PM', 'Date accessed': '8/30/2020 11:10 PM', 'Attributes': 'A', 'Perceived type': 'Unspecified', 'Owner': 'KEMALS-ASPIRE-E\\kemal', 'Kind': 'Document', 'Rating': 'Unrated'}

如您所见，返回的字典现在包含文件包含的所有元数据。之所以有效，是因为if 语句：

if attribute_value:

这意味着只要一个属性等于None，它就不会被添加到返回的字典中。

我要强调的是，在处理许多文件的情况下，最好将列表声明为全局/静态变量，而不是每次都将其传递给函数。

score 2 · Accepted Answer

问题在于 Windows 存储文件元数据的方式有两种。您使用的方法适用于 COM 应用程序创建的文件；此数据包含在文件本身中。但是，随着 NTFS5 的引入，任何文件都可以包含元数据作为备用数据流的一部分。因此，成功的文件可能是 COM 应用程序创建的文件，而失败的文件可能不是。

这是处理 COM 应用程序创建的文件的一种可能更可靠的方法：Get document summary information from any file。

使用备用数据流，可以直接读取它们：

meta = open('myfile.ext:StreamName').read()

更新：好的，现在我认为这些都不相关，因为您关注的是文档元数据而不是文件元数据。一个问题的清晰度可以带来什么不同：|

试试这个：如何在 python 中检索 office 文件的作者？

score 0 · Accepted Answer

Roger Upole 的回答非常有帮助。但是，我还需要阅读“.xls”文件中的“最后保存者”详细信息。

XLS 文件属性可以用win32com. 该Workbook对象具有BuiltinDocumentProperties. https://gist.github.com/justengel/87bac3355b1a925288c59500d2ce6ef5

import os
import win32com.client  # Requires "pip install pywin32"


__all__ = ['get_xl_properties', 'get_file_details']


# https://docs.microsoft.com/en-us/dotnet/api/microsoft.office.tools.excel.workbook.builtindocumentproperties?view=vsto-2017
BUILTIN_XLS_ATTRS = ['Title', 'Subject', 'Author', 'Keywords', 'Comments', 'Template', 'Last Author', 'Revision Number',
                     'Application Name', 'Last Print Date', 'Creation Date', 'Last Save Time', 'Total Editing Time',
                     'Number of Pages', 'Number of Words', 'Number of Characters', 'Security', 'Category', 'Format',
                     'Manager', 'Company', 'Number of Btyes', 'Number of Lines', 'Number of Paragraphs',
                     'Number of Slides', 'Number of Notes', 'Number of Hidden Slides', 'Number of Multimedia Clips',
                     'Hyperlink Base', 'Number of Characters (with spaces)']


def get_xl_properties(filename, xl=None):
    """Return the known XLS file attributes for the given .xls filename."""
    quit = False
    if xl is None:
        xl = win32com.client.DispatchEx('Excel.Application')
        quit = True

    # Open the workbook
    wb = xl.Workbooks.Open(filename)

    # Save the attributes in a dictionary
    attrs = {}
    for attrname in BUILTIN_XLS_ATTRS:
        try:
            val = wb.BuiltinDocumentProperties(attrname).Value
            if val:
                attrs[attrname] = val
        except:
            pass

    # Quit the excel application
    if quit:
        try:
            xl.Quit()
            del xl
        except:
            pass

    return attrs


def get_file_details(directory, filenames=None):
    """Collect the a file or list of files attributes.
    Args:
        directory (str): Directory or filename to get attributes for
        filenames (str/list/tuple): If the given directory is a directory then a filename or list of files must be given
    Returns:
         file_attrs (dict): Dictionary of {filename: {attribute_name: value}} or dictionary of {attribute_name: value}
            if a single file is given.
    """
    if os.path.isfile(directory):
        directory, filenames = os.path.dirname(directory), [os.path.basename(directory)]
    elif filenames is None:
        filenames = os.listdir(directory)
    elif not isinstance(filenames, (list, tuple)):
        filenames = [filenames]

    if not os.path.exists(directory):
        raise ValueError('The given directory does not exist!')

    # Open the com object
    sh = win32com.client.gencache.EnsureDispatch('Shell.Application', 0)  # Generates local compiled with make.py
    ns = sh.NameSpace(os.path.abspath(directory))

    # Get the directory file attribute column names
    cols = {}
    for i in range(512):  # 308 seemed to be max for excel file
        attrname = ns.GetDetailsOf(None, i)
        if attrname:
            cols[i] = attrname

    # Get the information for the files.
    files = {}
    for file in filenames:
        item = ns.ParseName(os.path.basename(file))
        files[os.path.abspath(item.Path)] = attrs = {}  # Store attributes in dictionary

        # Save attributes
        for i, attrname in cols.items():
            attrs[attrname] = ns.GetDetailsOf(item, i)

        # For xls file save special properties
        if os.path.splitext(file)[-1] == '.xls':
            xls_attrs = get_xl_properties(item.Path)
            attrs.update(xls_attrs)

    # Clean up the com object
    try:
        del sh
    except:
        pass

    if len(files) == 1:
        return files[list(files.keys())[0]]
    return files


if __name__ == '__main__':
    import argparse

    P = argparse.ArgumentParser(description="Read and print file details.")
    P.add_argument('filename', type=str, help='Filename to read and print the details for.')
    P.add_argument('-v', '--show-empty', action='store_true', help='If given print keys with empty values.')
    ARGS = P.parse_args()

    # Argparse Variables
    FILENAME = ARGS.filename
    SHOW_EMPTY = ARGS.show_empty
    DETAILS = get_file_details(FILENAME)

    print(os.path.abspath(FILENAME))
    for k, v in DETAILS.items():
        if v or SHOW_EMPTY:
            print('\t', k, '=', v)

score 0 · Accepted Answer

Windows API 代码包可与Python for .NET一起使用，以读取/写入文件元数据。

下载适用于WindowsAPICodePack-Core和 WindowsAPICodePack-Shell的 NuGet 包
使用 7-Zip 等压缩实用程序将.nupkg文件提取到脚本路径或系统路径变量中定义的某个位置。
使用 .NET 安装 Python pip install pythonnet。

获取和设置 MP4 视频标题的示例代码：

import clr
clr.AddReference("Microsoft.WindowsAPICodePack")
clr.AddReference("Microsoft.WindowsAPICodePack.Shell")
from Microsoft.WindowsAPICodePack.Shell import ShellFile

# create shell file object
f = ShellFile.FromFilePath(r'movie..mp4')

# read video title
print(f.Properties.System.Title.Value)

# set video title
f.Properties.System.Title.Value = 'My video'

破解以检查可用属性：

dir(f.Properties.System)

python - 使用 Python 读取元数据

5 回答 5

Related

Reference