amazon-s3 - 为大于 5GB 的文件计算 Amazon-S3 Etag 的算法是什么？

Question

上传到 Amazon S3 的小于 5GB 的文件有一个 ETag，它只是文件的 MD5 哈希值，这样可以轻松检查您的本地文件是否与您在 S3 上放置的文件相同。

但是如果你的文件大于 5GB，那么亚马逊会以不同的方式计算 ETag。

例如，我对一个 5,970,150,664 字节的文件进行了 380 个部分的分段上传。现在 S3 显示它的 ETag 为6bcf86bed8807b8e78f0fc6e0a53079d-380. 我的本地文件的 md5 哈希值为702242d3703818ddefe6bf7da2bed757. 我认为破折号后面的数字是分段上传中的部分数。

我还怀疑新的 ETag（在破折号之前）仍然是一个 MD5 哈希，但在多部分上传的过程中包含了一些元数据。

有谁知道如何使用与 Amazon S3 相同的算法来计算 ETag？

score 98 · Accepted Answer

假设您将一个 14MB 的文件上传到没有服务器端加密的存储桶中，并且您的部分大小为 5MB。计算每个部分对应的3个MD5校验和，即前5MB、后5MB、后4MB的校验和。然后取它们连接的校验和。MD5 校验和通常打印为二进制数据的十六进制表示，因此请确保您采用解码后的二进制连接的 MD5，而不是 ASCII 或 UTF-8 编码的连接。完成后，添加连字符和零件数以获取 ETag。

以下是从控制台在 Mac OS X 上执行此操作的命令：

$ dd bs=1m count=5 skip=0 if=someFile | md5 >>checksums.txt
5+0 records in
5+0 records out
5242880 bytes transferred in 0.019611 secs (267345449 bytes/sec)
$ dd bs=1m count=5 skip=5 if=someFile | md5 >>checksums.txt
5+0 records in
5+0 records out
5242880 bytes transferred in 0.019182 secs (273323380 bytes/sec)
$ dd bs=1m count=5 skip=10 if=someFile | md5 >>checksums.txt
2+1 records in
2+1 records out
2599812 bytes transferred in 0.011112 secs (233964895 bytes/sec)

此时所有校验和都在checksums.txt. 要连接它们并解码十六进制并获取批次的 MD5 校验和，只需使用

$ xxd -r -p checksums.txt | md5

现在附加“-3”以获取 ETag，因为有 3 个部分。

笔记

如果您通过 aws-cli 上传，那么您aws s3 cp很可能有 8MB 的块大小。根据文档，这是默认设置。
如果存储桶启用了服务器端加密 (SSE)，则 ETag 不会是 MD5 校验和（请参阅API 文档）。但是，如果您只是想验证上传的部分是否与您发送的内容相匹配，您可以使用Content-MD5标头，S3 会为您进行比较。
md5在 macOS 上只写出校验和，但md5sum在 Linux/brew 上也输出文件名。你需要去掉它，但我确信有一些选项可以只输出校验和。您无需担心空白会导致xxd忽略它。

代码链接

我用macOS 的工作脚本编写的要点。
s3md5的项目。

score 24 · Accepted Answer

根据此处的答案，我编写了一个 Python 实现，它可以正确计算多部分和单部分文件 ETag。

def calculate_s3_etag(file_path, chunk_size=8 * 1024 * 1024):
    md5s = []

    with open(file_path, 'rb') as fp:
        while True:
            data = fp.read(chunk_size)
            if not data:
                break
            md5s.append(hashlib.md5(data))

    if len(md5s) < 1:
        return '"{}"'.format(hashlib.md5().hexdigest())

    if len(md5s) == 1:
        return '"{}"'.format(md5s[0].hexdigest())

    digests = b''.join(m.digest() for m in md5s)
    digests_md5 = hashlib.md5(digests)
    return '"{}-{}"'.format(digests_md5.hexdigest(), len(md5s))

官方工具使用的默认 chunk_size 为 8 MB aws cli，它对 2+ 个块进行分段上传。它应该可以在 Python 2 和 3 下工作。

score 12 · Accepted Answer

bash 实现

蟒蛇实现

该算法的字面意思是（从 python 实现中的自述文件复制）：

md5 块
将 md5 字符串 glob 在一起
将 glob 转换为二进制
md5 全局块 md5s 的二进制文件
将“-Number_of_chunks”附加到二进制文件的 md5 字符串的末尾

score 9 · Accepted Answer

算法相同，java版本：（BaseEncoding、Hasher、Hashing等来自guava库

/**
 * Generate checksum for object came from multipart upload</p>
 * </p>
 * AWS S3 spec: Entity tag that identifies the newly created object's data. Objects with different object data will have different entity tags. The entity tag is an opaque string. The entity tag may or may not be an MD5 digest of the object data. If the entity tag is not an MD5 digest of the object data, it will contain one or more nonhexadecimal characters and/or will consist of less than 32 or more than 32 hexadecimal digits.</p> 
 * Algorithm follows AWS S3 implementation: https://github.com/Teachnova/s3md5</p>
 */
private static String calculateChecksumForMultipartUpload(List<String> md5s) {      
    StringBuilder stringBuilder = new StringBuilder();
    for (String md5:md5s) {
        stringBuilder.append(md5);
    }

    String hex = stringBuilder.toString();
    byte raw[] = BaseEncoding.base16().decode(hex.toUpperCase());
    Hasher hasher = Hashing.md5().newHasher();
    hasher.putBytes(raw);
    String digest = hasher.hash().toString();

    return digest + "-" + md5s.size();
}

score 9 · Accepted Answer

不确定它是否有帮助：

我们目前正在做一个丑陋（但到目前为止很有用）的黑客攻击来修复多部分上传文件中的错误 ETag，其中包括对存储桶中的文件应用更改；这会触发来自 Amazon 的 md5 重新计算，从而将 ETag 更改为与实际 md5 签名匹配。

在我们的例子中：

文件：bucket/Foo.mpg.gpg

获得的 ETag：“3f92dfeff0a11d175e60fb8b958b4e6e-2”
对文件做一些事情（重命名它，添加一个像假标题这样的元数据，等等）
获得的 Etag：“c1d903ca1bb6dc68778ef21e74cc15b0”

我们不知道算法，但由于我们可以“修复”ETag，我们也不需要担心它。

score 6 · Accepted Answer

这是这个疯狂的 AWS 挑战拼图中的另一部分。

FWIW，这个答案假设您已经知道如何计算“MD5 的 MD5 部分”，并且可以从此处已经提供的所有其他答案重建您的 AWS Multi-part ETag。

这个答案解决的是不得不“猜测”或以其他方式“判断”原始上传部分大小的烦恼。

我们使用几种不同的工具上传到 S3，它们似乎都有不同的上传部分大小，所以“猜测”真的不是一种选择。此外，我们有很多文件是历史上在零件尺寸似乎不同时上传的。此外，使用内部服务器副本强制创建 MD5 类型 ETag 的旧技巧也不再有效，因为 AWS 已将其内部服务器副本更改为也使用多部分（只是具有相当大的部分大小）。

所以...你怎么能算出物体的零件尺寸？

好吧，如果您首先发出一个 head_object 请求并检测到该 ETag 是一个多部分类型的 ETag（在末尾包含一个 '-<partcount>'），那么您可以发出另一个 head_object 请求，但带有一个额外的 part_number 属性1（第一部分）。这个后续的 head_object 请求将返回第一部分的 content_length。Viola... 现在您知道所使用的零件尺寸，您可以使用该尺寸重新创建本地 ETag，它应该与上传对象时创建的原始上传 S3 ETag 匹配。

此外，如果您想要准确（也许一些多部分上传使用可变部分大小），那么您可以继续调用 head_object 请求并指定每个部分编号，并根据返回的部分内容长度计算每个部分的 MD5。

希望有帮助...

score 5 · Accepted Answer

根据 AWS 文档，ETag 不是多部分上传的 MD5 哈希，也不是加密对象：http ://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html

由 PUT 对象、POST 对象或复制操作或通过 AWS 管理控制台创建并通过 SSE-S3 或明文加密的对象具有 ETag，即其对象数据的 MD5 摘要。

由 PUT 对象、POST 对象或复制操作或通过 AWS 管理控制台创建并由 SSE-C 或 SSE-KMS 加密的对象具有不是其对象数据的 MD5 摘要的 ETag。

如果一个对象是由 Multipart Upload 或 Part Copy 操作创建的，则无论加密方法如何，ETag 都不是 MD5 摘要。

score 5 · Accepted Answer

在上面的答案中，有人问是否有办法为大于 5G 的文件获取 md5。

对于获取 MD5 值（对于大于 5G 的文件），我可以给出的答案是将其手动添加到元数据中，或者使用程序进行上传以添加信息。

例如，我使用 s3cmd 上传文件，它添加了以下元数据。

$ aws s3api head-object --bucket xxxxxxx --key noarch/epel-release-6-8.noarch.rpm 
{
  "AcceptRanges": "bytes", 
  "ContentType": "binary/octet-stream", 
  "LastModified": "Sat, 19 Sep 2015 03:27:25 GMT", 
  "ContentLength": 14540, 
  "ETag": "\"2cd0ae668a585a14e07c2ea4f264d79b\"", 
  "Metadata": {
    "s3cmd-attrs": "uid:502/gname:staff/uname:xxxxxx/gid:20/mode:33188/mtime:1352129496/atime:1441758431/md5:2cd0ae668a585a14e07c2ea4f264d79b/ctime:1441385182"
  }
}

它不是使用 ETag 的直接解决方案，但它是一种以您可以访问的方式填充所需元数据 (MD5) 的方法。如果有人上传没有元数据的文件，它仍然会失败。

score 3 · Accepted Answer

这是ruby中的算法......

require 'digest'

# PART_SIZE should match the chosen part size of the multipart upload
# Set here as 10MB
PART_SIZE = 1024*1024*10 

class File
  def each_part(part_size = PART_SIZE)
    yield read(part_size) until eof?
  end
end

file = File.new('<path_to_file>')

hashes = []

file.each_part do |part|
  hashes << Digest::MD5.hexdigest(part)
end

multipart_hash = Digest::MD5.hexdigest([hashes.join].pack('H*'))
multipart_etag = "#{multipart_hash}-#{hashes.count}"

感谢Ruby 中最短的 Hex2Bin和多部分上传到 S3 ...

score 2 · Accepted Answer

这是计算 ETag 的 PHP 版本：

function calculate_aws_etag($filename, $chunksize) {
    /*
    DESCRIPTION:
    - calculate Amazon AWS ETag used on the S3 service
    INPUT:
    - $filename : path to file to check
    - $chunksize : chunk size in Megabytes
    OUTPUT:
    - ETag (string)
    */
    $chunkbytes = $chunksize*1024*1024;
    if (filesize($filename) < $chunkbytes) {
        return md5_file($filename);
    } else {
        $md5s = array();
        $handle = fopen($filename, 'rb');
        if ($handle === false) {
            return false;
        }
        while (!feof($handle)) {
            $buffer = fread($handle, $chunkbytes);
            $md5s[] = md5($buffer);
            unset($buffer);
        }
        fclose($handle);

        $concat = '';
        foreach ($md5s as $indx => $md5) {
            $concat .= hex2bin($md5);
        }
        return md5($concat) .'-'. count($md5s);
    }
}

$etag = calculate_aws_etag('path/to/myfile.ext', 8);

这是一个增强版本，可以根据预期的 ETag 进行验证 - 如果您不知道，甚至可以猜测块大小！

function calculate_etag($filename, $chunksize, $expected = false) {
    /*
    DESCRIPTION:
    - calculate Amazon AWS ETag used on the S3 service
    INPUT:
    - $filename : path to file to check
    - $chunksize : chunk size in Megabytes
    - $expected : verify calculated etag against this specified etag and return true or false instead
        - if you make chunksize negative (eg. -8 instead of 8) the function will guess the chunksize by checking all possible sizes given the number of parts mentioned in $expected
    OUTPUT:
    - ETag (string)
    - or boolean true|false if $expected is set
    */
    if ($chunksize < 0) {
        $do_guess = true;
        $chunksize = 0 - $chunksize;
    } else {
        $do_guess = false;
    }

    $chunkbytes = $chunksize*1024*1024;
    $filesize = filesize($filename);
    if ($filesize < $chunkbytes && (!$expected || !preg_match("/^\\w{32}-\\w+$/", $expected))) {
        $return = md5_file($filename);
        if ($expected) {
            $expected = strtolower($expected);
            return ($expected === $return ? true : false);
        } else {
            return $return;
        }
    } else {
        $md5s = array();
        $handle = fopen($filename, 'rb');
        if ($handle === false) {
            return false;
        }
        while (!feof($handle)) {
            $buffer = fread($handle, $chunkbytes);
            $md5s[] = md5($buffer);
            unset($buffer);
        }
        fclose($handle);

        $concat = '';
        foreach ($md5s as $indx => $md5) {
            $concat .= hex2bin($md5);
        }
        $return = md5($concat) .'-'. count($md5s);
        if ($expected) {
            $expected = strtolower($expected);
            $matches = ($expected === $return ? true : false);
            if ($matches || $do_guess == false || strlen($expected) == 32) {
                return $matches;
            } else {
                // Guess the chunk size
                preg_match("/-(\\d+)$/", $expected, $match);
                $parts = $match[1];
                $min_chunk = ceil($filesize / $parts /1024/1024);
                $max_chunk =  floor($filesize / ($parts-1) /1024/1024);
                $found_match = false;
                for ($i = $min_chunk; $i <= $max_chunk; $i++) {
                    if (calculate_aws_etag($filename, $i) === $expected) {
                        $found_match = true;
                        break;
                    }
                }
                return $found_match;
            }
        } else {
            return $return;
        }
    }
}

score 1 · Accepted Answer

Rust 中的一个版本：

use crypto::digest::Digest;
use crypto::md5::Md5;
use std::fs::File;
use std::io::prelude::*;
use std::iter::repeat;

fn calculate_etag_from_read(f: &mut dyn Read, chunk_size: usize) -> Result<String> {
    let mut md5 = Md5::new();
    let mut concat_md5 = Md5::new();
    let mut input_buffer = vec![0u8; chunk_size];
    let mut chunk_count = 0;
    let mut current_md5: Vec<u8> = repeat(0).take((md5.output_bits() + 7) / 8).collect();

    let md5_result = loop {
        let amount_read = f.read(&mut input_buffer)?;
        if amount_read > 0 {
            md5.reset();
            md5.input(&input_buffer[0..amount_read]);
            chunk_count += 1;
            md5.result(&mut current_md5);
            concat_md5.input(&current_md5);
        } else {
            if chunk_count > 1 {
                break format!("{}-{}", concat_md5.result_str(), chunk_count);
            } else {
                break md5.result_str();
            }
        }
    };
    Ok(md5_result)
}

fn calculate_etag(file: &String, chunk_size: usize) -> Result<String> {
    let mut f = File::open(file)?;
    calculate_etag_from_read(&mut f, chunk_size)
}

查看一个简单实现的仓库：https ://github.com/bn3t/calculate-etag/tree/master

score 1 · Accepted Answer

node.js 实现 -

const fs = require('fs');
const crypto = require('crypto');

const chunk = 1024 * 1024 * 5; // 5MB

const md5 = data => crypto.createHash('md5').update(data).digest('hex');

const getEtagOfFile = (filePath) => {
  const stream = fs.readFileSync(filePath);
  if (stream.length <= chunk) {
    return md5(stream);
  }
  const md5Chunks = [];
  const chunksNumber = Math.ceil(stream.length / chunk);
  for (let i = 0; i < chunksNumber; i++) {
    const chunkStream = stream.slice(i * chunk, (i + 1) * chunk);
    md5Chunks.push(md5(chunkStream));
  }

  return `${md5(Buffer.from(md5Chunks.join(''), 'hex'))}-${chunksNumber}`;
};

score 1 · Accepted Answer

简短的回答是您获取每个部分的 128 位二进制 md5 摘要，将它们连接成一个文档，然后对该文档进行哈希处理。此答案中提出的算法是准确的。

注意：如果您“触摸”blob（即使不修改内容），带有连字符的多部分 ETAG 表单将变为不带连字符的表单。也就是说，如果您复制或对已完成的多部分上传对象（又名 PUT-COPY）进行就地复制，S3 将使用算法的简单版本重新计算 ETAG。即目标对象将有一个不带连字符的etag。

您可能已经考虑过这一点，但是如果您的文件小于 5GB，并且您已经知道它们的 MD5，并且上传并行化几乎没有任何好处（例如，您正在从慢速网络传输上传，或者从慢速磁盘上传)，那么您也可以考虑使用简单的 PUT 而不是多部分 PUT，并在您的请求标头中传递您已知的 Content-MD5 - 如果它们不匹配，亚马逊将无法上传。请记住，您需要为每个 UploadPart 付费。

此外，在某些客户端中，为 PUT 操作的输入传递已知的 MD5 将使客户端免于在传输期间重新计算 MD5。例如，在 boto3 (python) 中，您将使用client.put_object()方法的ContentMD5参数。如果您省略该参数，并且您已经知道 MD5，那么客户端将在传输之前浪费循环再次计算它。

score 0 · Accepted Answer

我喜欢上面艾默生的主要答案——尤其是那xxd部分——但我太懒了，dd所以我用了split，猜测一个 8M 的块大小，因为我上传的是aws s3 cp：

$ split -b 8M large.iso XXX
$ md5sum XXX* > checksums.txt
$ sed -i 's/ .*$//' checksums.txt 
$ xxd -r -p checksums.txt | md5sum
99a090df013d375783f0f0be89288529  -
$ wc -l checksums.txt 
80 checksums.txt
$

很明显，我的 S3 etag 的两个部分都与我文件的计算 etag 匹配。

更新：

这一直运作良好：

$ ll large.iso
-rw-rw-r-- 1 user   user   669134848 Apr 12  2021 large.iso
$ 
$ etag large.iso
99a090df013d375783f0f0be89288529-80
$ 
$ type etag
etag is a function
etag () 
{ 
    split -b 8M --filter=md5sum $1 | cut -d' ' -f1 | pee "xxd -r -p | md5sum | cut -d' ' -f1" "wc -l" | paste -d'-' - -
}
$

score 0 · Accepted Answer

我刚刚看到 AWS S3 控制台“上传”使用了 17,179,870 的不寻常部分（块）大小 - 至少对于较大的文件。

使用该零件大小可以使用前面描述的方法为我提供正确的 ETag 散列。感谢 @TheStoryCoder 的 php 版本。

感谢@hans 提出使用 head-object 查看每个部分的实际尺寸的想法。

我使用 AWS S3 控制台（2020 年 11 月 28 日）上传了大约 50 个文件，大小从 190MB 到 2.3GB 不等，所有文件的部分大小都相同，为 17,179,870。

score 0 · Accepted Answer

在 Node.js (TypeScript) 中实现的工作算法。

/**
 * Generate an S3 ETAG for multipart uploads in Node.js 
 * An implementation of this algorithm: https://stackoverflow.com/a/19896823/492325
 * Author: Richard Willis <willis.rh@gmail.com>
 */
import fs from 'node:fs';
import crypto, { BinaryLike } from 'node:crypto';

const defaultPartSizeInBytes = 5 * 1024 * 1024; // 5MB

function md5(contents: string | BinaryLike): string {
  return crypto.createHash('md5').update(contents).digest('hex');
}

export function getS3Etag(
  filePath: string,
  partSizeInBytes = defaultPartSizeInBytes
): string {
  const { size: fileSizeInBytes } = fs.statSync(filePath);
  let parts = Math.floor(fileSizeInBytes / partSizeInBytes);
  if (fileSizeInBytes % partSizeInBytes > 0) {
    parts += 1;
  }
  const fileDescriptor = fs.openSync(filePath, 'r');
  let totalMd5 = '';

  for (let part = 0; part < parts; part++) {
    const skipBytes = partSizeInBytes * part;
    const totalBytesLeft = fileSizeInBytes - skipBytes;
    const bytesToRead = Math.min(totalBytesLeft, partSizeInBytes);
    const buffer = Buffer.alloc(bytesToRead);
    fs.readSync(fileDescriptor, buffer, 0, bytesToRead, skipBytes);
    totalMd5 += md5(buffer);
  }

  const combinedHash = md5(Buffer.from(totalMd5, 'hex'));
  const etag = `${combinedHash}-${parts}`;
  return etag;
}

我已将此发布到 npm

npm install s3-etag

import { generateETag } from 's3-etag';

const etag = generateETag(absoluteFilePath, partSizeInBytes);

在此处查看项目：https ://github.com/badsyntax/s3-etag

score 0 · Accepted Answer

我有一个适用于 iOS 和 macOS 的解决方案，而无需使用 dd 和 xxd 等外部助手。刚刚找到，所以就照原样报告，打算后期改进。目前，它同时依赖于 Objective-C 和 Swift 代码。首先，在 Objective-C 中创建这个辅助类：

AWS3MD5Hash.h

#import <Foundation/Foundation.h>

NS_ASSUME_NONNULL_BEGIN

@interface AWS3MD5Hash : NSObject

- (NSData *)dataFromFile:(FILE *)theFile startingOnByte:(UInt64)startByte length:(UInt64)length filePath:(NSString *)path singlePartSize:(NSUInteger)partSizeInMb;

- (NSData *)dataFromBigData:(NSData *)theData startingOnByte:(UInt64)startByte length:(UInt64)length;

- (NSData *)dataFromHexString:(NSString *)sourceString;

@end

NS_ASSUME_NONNULL_END

AWS3MD5Hash.m

#import "AWS3MD5Hash.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZE 256

@implementation AWS3MD5Hash


- (NSData *)dataFromFile:(FILE *)theFile startingOnByte:(UInt64)startByte length:(UInt64)length filePath:(NSString *)path singlePartSize:(NSUInteger)partSizeInMb {


   char *buffer = malloc(length);


   NSURL *fileURL = [NSURL fileURLWithPath:path];
   NSNumber *fileSizeValue = nil;
   NSError *fileSizeError = nil;
   [fileURL getResourceValue:&fileSizeValue
                           forKey:NSURLFileSizeKey
                            error:&fileSizeError];

   NSInteger __unused result = fseek(theFile,startByte,SEEK_SET);

   if (result != 0) {
      free(buffer);
      return nil;
   }

   NSInteger result2 = fread(buffer, length, 1, theFile);

   NSUInteger difference = fileSizeValue.integerValue - startByte;

   NSData *toReturn;

   if (result2 == 0) {
       toReturn = [NSData dataWithBytes:buffer length:difference];
    } else {
       toReturn = [NSData dataWithBytes:buffer length:result2 * length];
    }

     free(buffer);

     return toReturn;
 }

 - (NSData *)dataFromBigData:(NSData *)theData startingOnByte:  (UInt64)startByte length:(UInt64)length {

   NSUInteger fileSizeValue = theData.length;
   NSData *subData;

   if (startByte + length > fileSizeValue) {
        subData = [theData subdataWithRange:NSMakeRange(startByte, fileSizeValue - startByte)];
    } else {
       subData = [theData subdataWithRange:NSMakeRange(startByte, length)];
    }

        return subData;
    }

- (NSData *)dataFromHexString:(NSString *)string {
    string = [string lowercaseString];
    NSMutableData *data= [NSMutableData new];
    unsigned char whole_byte;
    char byte_chars[3] = {'\0','\0','\0'};
    NSInteger i = 0;
    NSInteger length = string.length;
    while (i < length-1) {
       char c = [string characterAtIndex:i++];
       if (c < '0' || (c > '9' && c < 'a') || c > 'f')
           continue;
       byte_chars[0] = c;
       byte_chars[1] = [string characterAtIndex:i++];
       whole_byte = strtol(byte_chars, NULL, 16);
       [data appendBytes:&whole_byte length:1];
    }

        return data;
}


@end

现在创建一个普通的 swift 文件：

AWS Extensions.swift

import UIKit
import CommonCrypto

extension URL {

func calculateAWSS3MD5Hash(_ numberOfParts: UInt64) -> String? {


    do {

        var fileSize: UInt64!
        var calculatedPartSize: UInt64!

        let attr:NSDictionary? = try FileManager.default.attributesOfItem(atPath: self.path) as NSDictionary
        if let _attr = attr {
            fileSize = _attr.fileSize();
            if numberOfParts != 0 {



                let partSize = Double(fileSize / numberOfParts)

                var partSizeInMegabytes = Double(partSize / (1024.0 * 1024.0))



                partSizeInMegabytes = ceil(partSizeInMegabytes)

                calculatedPartSize = UInt64(partSizeInMegabytes)

                if calculatedPartSize % 2 != 0 {
                    calculatedPartSize += 1
                }

                if numberOfParts == 2 || numberOfParts == 3 { // Very important when there are 2 or 3 parts, in the majority of times
                                                              // the calculatedPartSize is already 8. In the remaining cases we force it.
                    calculatedPartSize = 8
                }


                if mainLogToggling {
                    print("The calculated part size is \(calculatedPartSize!) Megabytes")
                }

            }

        }

        if numberOfParts == 0 {

            let string = self.memoryFriendlyMd5Hash()
            return string

        }




        let hasher = AWS3MD5Hash.init()
        let file = fopen(self.path, "r")
        defer { let result = fclose(file)}


        var index: UInt64 = 0
        var bigString: String! = ""
        var data: Data!

        while autoreleasepool(invoking: {

                if index == (numberOfParts-1) {
                    if mainLogToggling {
                        //print("Siamo all'ultima linea.")
                    }
                }

                data = hasher.data(from: file!, startingOnByte: index * calculatedPartSize * 1024 * 1024, length: calculatedPartSize * 1024 * 1024, filePath: self.path, singlePartSize: UInt(calculatedPartSize))

                bigString = bigString + MD5.get(data: data) + "\n"

                index += 1

                if index == numberOfParts {
                    return false
                }
                return true

        }) {}

        let final = MD5.get(data :hasher.data(fromHexString: bigString)) + "-\(numberOfParts)"

        return final

    } catch {

    }

    return nil
}

   func memoryFriendlyMd5Hash() -> String? {

    let bufferSize = 1024 * 1024

    do {
        // Open file for reading:
        let file = try FileHandle(forReadingFrom: self)
        defer {
            file.closeFile()
        }

        // Create and initialize MD5 context:
        var context = CC_MD5_CTX()
        CC_MD5_Init(&context)

        // Read up to `bufferSize` bytes, until EOF is reached, and update MD5 context:
        while autoreleasepool(invoking: {
            let data = file.readData(ofLength: bufferSize)
            if data.count > 0 {
                data.withUnsafeBytes {
                    _ = CC_MD5_Update(&context, $0, numericCast(data.count))
                }
                return true // Continue
            } else {
                return false // End of file
            }
        }) { }

        // Compute the MD5 digest:
        var digest = Data(count: Int(CC_MD5_DIGEST_LENGTH))
        digest.withUnsafeMutableBytes {
            _ = CC_MD5_Final($0, &context)
        }
        let hexDigest = digest.map { String(format: "%02hhx", $0) }.joined()
        return hexDigest

    } catch {
        print("Cannot open file:", error.localizedDescription)
        return nil
    }
}

struct MD5 {

    static func get(data: Data) -> String {
        var digest = [UInt8](repeating: 0, count: Int(CC_MD5_DIGEST_LENGTH))

        let _ = data.withUnsafeBytes { bytes in
            CC_MD5(bytes, CC_LONG(data.count), &digest)
        }
        var digestHex = ""
        for index in 0..<Int(CC_MD5_DIGEST_LENGTH) {
            digestHex += String(format: "%02x", digest[index])
        }

        return digestHex
    }
    // The following is a memory friendly version
    static func get2(data: Data) -> String {

    var currentIndex = 0
    let bufferSize = 1024 * 1024
    //var digest = [UInt8](repeating: 0, count: Int(CC_MD5_DIGEST_LENGTH))

    // Create and initialize MD5 context:
    var context = CC_MD5_CTX()
    CC_MD5_Init(&context)


    while autoreleasepool(invoking: {
        var subData: Data!
        if (currentIndex + bufferSize) < data.count {
            subData = data.subdata(in: Range.init(NSMakeRange(currentIndex, bufferSize))!)
            currentIndex = currentIndex + bufferSize
        } else {
            subData = data.subdata(in: Range.init(NSMakeRange(currentIndex, data.count - currentIndex))!)
            currentIndex = currentIndex + (data.count - currentIndex)
        }
        if subData.count > 0 {
            subData.withUnsafeBytes {
                _ = CC_MD5_Update(&context, $0, numericCast(subData.count))
            }
            return true
        } else {
            return false
        }

    }) { }

    // Compute the MD5 digest:
    var digest = Data(count: Int(CC_MD5_DIGEST_LENGTH))
    digest.withUnsafeMutableBytes {
        _ = CC_MD5_Final($0, &context)
    }

    var digestHex = ""
    for index in 0..<Int(CC_MD5_DIGEST_LENGTH) {
        digestHex += String(format: "%02x", digest[index])
    }

    return digestHex

}
}

现在添加：

#import "AWS3MD5Hash.h"

到您的 Objective-C 桥接头。你应该可以接受这个设置。

示例用法

要测试此设置，您可以在负责处理 AWS 连接的对象中调用以下方法：

func getMd5HashForFile() {


    let credentialProvider = AWSCognitoCredentialsProvider(regionType: AWSRegionType.USEast2, identityPoolId: "<INSERT_POOL_ID>")
    let configuration = AWSServiceConfiguration(region: AWSRegionType.APSoutheast2, credentialsProvider: credentialProvider)
    configuration?.timeoutIntervalForRequest = 3.0
    configuration?.timeoutIntervalForResource = 3.0

    AWSServiceManager.default().defaultServiceConfiguration = configuration

    AWSS3.register(with: configuration!, forKey: "defaultKey")
    let s3 = AWSS3.s3(forKey: "defaultKey")


    let headObjectRequest = AWSS3HeadObjectRequest()!
    headObjectRequest.bucket = "<NAME_OF_YOUR_BUCKET>"
    headObjectRequest.key = self.latestMapOnServer.key




    let _: AWSTask? = s3.headObject(headObjectRequest).continueOnSuccessWith { (awstask) -> Any? in

        let headObjectOutput: AWSS3HeadObjectOutput? = awstask.result

        var ETag = headObjectOutput?.eTag!
        // Here you should parse the returned Etag and extract the number of parts to provide to the helper function. Etags end with a "-" followed by the number of parts. If you don't see this format, then pass 0 as the number of parts.
        ETag = ETag!.replacingOccurrences(of: "\"", with: "")

        print("headObjectOutput.ETag \(ETag!)")

        let mapOnDiskUrl = self.getMapsDirectory().appendingPathComponent(self.latestMapOnDisk!)

        let hash = mapOnDiskUrl.calculateAWSS3MD5Hash(<Take the number of parts from the ETag returned by the server>)

        if hash == ETag {
            print("They are the same.")
        }

        print ("\(hash!)")

        return nil
    }



}

如果服务器返回的 ETag 没有以 ETag 结尾的“-”，则只需传 0 即可计算 AWSS3MD5Hash。如果您遇到任何问题，请发表评论。我正在研究一个快速的解决方案，我会在完成后立即更新这个答案。谢谢

score 0 · Accepted Answer

关于块大小，我注意到它似乎取决于零件的数量。AWS 文档的最大零件数为 10000。

因此，从默认的 8MB 开始，知道文件大小、块大小和部分可以计算如下：

chunk_size=8*1024*1024
flsz=os.path.getsize(fl)

while flsz/chunk_size>10000:
  chunk_size*=2

parts=math.ceil(flsz/chunk_size)

零件必须上圆

score -4 · Accepted Answer

不，

到现在还没有匹配普通文件ETag和Multipart文件ETag和本地文件MD5的解决方案。

amazon-s3 - 为大于 5GB 的文件计算 Amazon-S3 Etag 的算法是什么？

19 回答 19

Related

Reference