python - 验证上传的文件是 Python 中的 word 文档

Question

在我的网络应用程序（Flask）中，我让用户上传一个 Word 文档。

我检查文件的扩展名是否为.doc或.docx。但是，我将.jpg文件的扩展名更改为.docx并且它也通过了（如我所料）。

有没有办法验证上传的文件确实是 Word 文档？我搜索并阅读了有关文件标题的内容，但找不到任何其他信息。

我正在使用 boto 将文件上传到 aws，以防万一。谢谢。

score 2 · Accepted Answer

好吧，python-magic评论中链接的问题中的那个库看起来是一个非常简单的解决方案。

不过，我会给出一个更手动的选项。根据该站点，DOC 文件的签名为D0 CF 11 E0 A1 B1 1A E1（8 个字节），而 DOCX 文件的签名为50 4B 03 04（4 个字节）。两者的偏移量都是 0。可以安全地假设这些文件是小端的，因为它们来自 Microsoft（不过，Office 文件可能在 Mac 上是大端的？我不确定）

struct您可以使用模块解压缩二进制数据，如下所示：

>>> with open("foo.doc", "rb") as h:
...    buf = h.read()
>>> byte = struct.unpack_from("<B", buf, 0)[0]
>>> print("{0:x}".format(byte))
d0

因此，在这里，我们从包含从文件读取的二进制数据的缓冲区中解压缩第一个小端（“<”）字节（“B”），偏移量为 0，我们找到“D0”，即 a 中的第一个字节文档文件。如果我们将偏移量设置为 1，我们会得到 CF，即第二个字节。

让我们检查一下它是否确实是一个 DOC 文件：

def is_doc(file):
    with open(file, 'rb') as h:
        buf = h.read()
    fingerprint = []
    if len(buf) > 8:
        for i in range(8):
            byte = struct.unpack_from("<B", buf, i)[0]
            fingerprint.append("{0:x}".format(byte))
    if ' '.join(fingerprint).upper() == "D0 CF 11 E0 A1 B1 1A E1":        
        return True
    return False

>>> is_doc("foo.doc")
True

不幸的是，我没有要测试的 DOCX 文件，但过程应该是相同的，除了你只得到前 4 个字节并与另一个指纹进行比较。

score 1 · Accepted Answer

Docx 文件实际上是 zip 文件。此 zip 包含三个基本文件夹word：docProps和_rels. 因此，用于zipfile测试这些文件是否存在于该文件中。

import zipfile

def isdir(z, name):
   return any(x.startswith("%s/" % name.rstrip("/")) for x in z.namelist())

def isValidDocx(filename):
  f = zipfile.ZipFile(filename, "r")
  return isdir(f, "word") and isdir(f, "docProps") and isdir(f, "_rels")

代码改编自Check if a directory exists in a zip file with Python

但是，包含这些文件夹的任何 ZIP 都将绕过验证。我也不知道它是否适用于 DOC 或加密 DOCS。

score 1 · Accepted Answer

您可以使用 python-docx 库

下面的代码将引发值错误是文件不是 docx 文件。

from docx import Document
try:
    Document("abc.docx")
except ValueError:
    print "Not a valid document type"

score 0 · Accepted Answer

我用来python-magic验证文件类型是否是word文档。然而我遇到了很多问题。如：不同的单词版本或不同的软件导致不同的类型。所以我放弃了python-magic。

这是我的解决方案。

DOC_MAGIC_BYTES = [
    "D0 CF 11 E0 A1 B1 1A E1",
    "0D 44 4F 43",
    "CF 11 E0 A1 B1 1A E1 00",
    "DB A5 2D 00",
    "EC A5 C1 00"
]
DOCX_MAGIC_BYTES = [
    "50 4B 03 04"
]

def validate_is_word(content):
    magic_bytes = content[:8]
    fingerprint = []
    bytes_len = len(magic_bytes)
    if bytes_len >= 4:
        for i in xrange(bytes_len):
            byte = struct.unpack_from("<B", magic_bytes, i)[0]
            fingerprint.append("{:02x}".format(byte))
    if not fingerprint:
        return False
    if is_docx_file(fingerprint):
        return True
    if is_doc_file(fingerprint):
        return True
    return False


def is_doc_file(magic_bytes):
    four_bytes = " ".join(magic_bytes[:4]).upper()
    all_bytes = " ".join(magic_bytes).upper()
    return four_bytes in DOC_MAGIC_BYTES or all_bytes in DOC_MAGIC_BYTES


def is_docx_file(magic_bytes):
    type_ = " ".join(magic_bytes[:4]).upper()
    return type_ in DOCX_MAGIC_BYTES

你可以试试这个。

score 0 · Accepted Answer

我使用 filetype python lib 来检查和比较 mime 类型及其文档扩展名，因此我的用户不能仅仅通过更改他们的文件扩展名来欺骗我。

pip install filetype

然后

import filetype

kind = filetype.guess('path/to/file')
mime = kind.mime
ext = kind.extension

你可以在这里查看他们的文档

score 0 · Accepted Answer

python-magic在检测docx和pptx格式方面做得很好。

这里有一些例子：

In [60]: magic.from_file("oz123.docx")
Out[60]: 'Microsoft Word 2007+'

In [61]: magic.from_file("oz123.docx", mime=True)
Out[61]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'

In [62]: magic.from_file("presentation.pptx")
Out[62]: 'Microsoft PowerPoint 2007+'

In [63]: magic.from_file("presentation.pptx", mime=True)
Out[63]: 'application/vnd.openxmlformats-officedocument.presentationml.presentation'

由于 OP 询问了文件上传，因此文件句柄不是很有用。幸运的是， magic支持从缓冲区检测：

In [63]: fdox
Out[63]: <_io.BufferedReader name='/home/oz123/Documents/oz123.docx'>

In [64]: magic.from_buffer(fdox.read(2048))
Out[64]: 'Zip archive data, at least v2.0 to extract

天真地，我们读取的数量太少了……读取更多字节可以解决问题：

In [65]: fdox.seek(0)
Out[65]: 0

In [66]: magic.from_buffer(fdox.read(4096))
Out[66]: 'Microsoft Word 2007+'

In [67]: fdox.seek(0)
Out[67]: 0

In [68]: magic.from_buffer(fdox.read(4096), mime=True)
Out[68]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'

python - 验证上传的文件是 Python 中的 word 文档

6 回答 6

Related

Reference