python - wkhtmltopdf 在每次运行时生成不同的校验和

Question

我正在尝试验证从 wkhtmltopdf 生成的内容在每次运行中是否相同，但是每次运行 wkhtmltopdf 时，我都会针对同一页面获得不同的哈希/校验和值。我们正在谈论一些真正的基本内容，例如使用以下 html 页面：

<html>
<body>
<p> This is some text</p>
</body
</html>

每次我使用以下惊人的行运行 wkhtmltopdf 时，我都会得到不同的 md5 或 sha256 哈希：

./wkhtmltopdf example.html ~/Documents/a.pdf

并使用以下python哈希：

def shasum(filename):
    sha = hashlib.sha256()
    with open(filename,'rb') as f: 
        for chunk in iter(lambda: f.read(128*sha.block_size), b''): 
            sha.update(chunk)
    return sha.hexdigest()

或者只是用 md5 交换 sha256 的 md5 版本

为什么 wkhtmltopdf 会生成一个足以导致不同校验和的不同文件，有没有办法不这样做？可以传递一些命令行来防止这种情况发生吗？

我试过 --default-header、--no-pdf-compression 和 --disable-smart-shrinking

这是在 MAC osx 上，但我已经在其他机器上生成了这些 pdf，并以相同的结果下载了它们。

wkhtmltopdf 版本 = 0.10.0 rc2

score 2 · Accepted Answer

我尝试了这个并在emacs中打开了生成的PDF。wkhtmltopdf 在 PDF 中嵌入了“/CreationDate”字段。每次运行都会有所不同，并且会在运行之间搞砸哈希值。

我没有看到禁用“/CreationDate”字段的选项，但在计算哈希之前将其从文件中删除会很简单。

score 1 · Accepted Answer

我编写了一个方法来将创建日期从预期输出复制到当前生成的文件。它在 Ruby 中，参数是任何像 IO 一样走路和嘎嘎的类：

def copy_wkhtmltopdf_creation_date(to, from)
  to_current_pos, from_current_pos = [to.pos, from.pos]
  to.pos = from.pos = 74
  to.write(from.read(14))
  to.pos, from.pos = [to_current_pos, from_current_pos]
end

score 0 · Accepted Answer

我受到 Carlos 的启发，编写了一个不使用硬编码索引的解决方案，因为在我的文档中，索引与 Carlos 的 74 不同。

另外，我还没有打开文件。CreationDate并且在没有发现的情况下处理提早返回的情况。

def copy_wkhtmltopdf_creation_date(to, from)
  index, date = File.foreach(from).reduce(0) do |acc, line|
    if line.index("CreationDate")
      break [acc + line.index(/\d{14}/), $~[0]]
    else
      acc + line.bytesize
    end
  end

  if date # IE, yes this is a wkhtmltopdf document
    File.open(to, "r+") do |to|
      to.pos = index
      to.write(date)
    end
  end
end

score 0 · Accepted Answer

我们通过使用简单的正则表达式去除创建日期解决了这个问题。

preg_replace("/\\/CreationDate \\(D:.*\\)\\n/uim", "", $file_contents, 1);

这样做之后，我们每次都可以获得一致的校验和。

python - wkhtmltopdf 在每次运行时生成不同的校验和

4 回答 4

Related

Reference