5

This is my first question here as I'm fairly new to this world! I've spent a few days trying to figure this out for myself, but haven't so far been able to find any useful info.

I'm trying to retrieve a byte range from a file stored in S3, using something like:

S3Key.get_contents_to_file(tempfile, headers={'Range': 'bytes=0-100000'}

The file that I'm trying to restore from is a video file, specifically an MXF. When I request a byte range, I get back more info in the tempfile than requested. For example, using one file, I request 100,000 bytes and get back 100,451.

One thing to note about MXF files is that they legitimately contain 0x0A (ASCII line feed) and 0x0D (ASCII carriage return).

I had a dig around and it appears that any time a 0D byte is present in the file, the retrieved info adds 0A 0D instead of just 0D, therefore appearing to retrieve more info than required.

As an example, original file contains the Hex string of:

02 03 00 00 00 00 3B 0A 06 0E 2B 34 01 01 01 05

But the file downloaded form S3 has:

02 03 00 00 00 00 3B 0D 0A 06 0E 2B 34 01 01 01 05

I've tried to debug the code and work my way through the Boto logic, but I'm relatively new at this, so get lost very easily.

I created this for testing, which shows the issue

from boto.s3.connection import S3Connection
from boto.s3.connection import Location
from boto.s3.key import Key
import boto
import os


## AWS credentials
AWS_ACCESS_KEY_ID = 'secret key'
AWS_SECRET_ACCESS_KEY = 'access key'

## Bucket name and path to file
bucketName = 'bucket name'
filePath = 'path/to/file.mxf'

#Local temp file to download to
tempFilePath = 'c:/tmp/tempfile'


## Setup the S3 connection and create a Key to access the file specified
## in filePath
conn = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(bucketName)
S3Key = Key(bucket)
S3Key.key = filePath

def testRangeGet(bytesToRead=100000): # default read of 100K
    tempfile = open(tempFilePath, 'w')
    rangeString = 'bytes=0-' + str(bytesToRead -1)  #create byte range as string
    rangeDict = {'Range': rangeString} # add this to the dictionary
    S3Key.get_contents_to_file(tempfile, headers=rangeDict) # using Boto
    tempfile.close()
    bytesRead = os.path.getsize(tempFilePath)
    print 'Bytes requested = ' + str(bytesToRead)
    print 'Bytes recieved = ' + str(bytesRead)
    print 'Additional bytes = ' + str(bytesRead - bytesToRead)

I guess there is something in the Boto code that is looking out for certain ASCII escape characters and modifying them, and I can't find any way to specify to just treat it as a binary file.

Has anyone had a similar problem and can share a way around it?

Thanks

Tim

4

1 回答 1

2

将输出文件作为二进制文件打开。否则写入该文件将自动将 LF 转换为 CR/LF。

tempfile = open(tempFilePath, 'wb')

当然,这仅在 Windows 系统上是必需的。无论文件是作为文本文件还是二进制文件打开,Unix 都不会转换任何内容。

您在上传时也应该小心,首先不要将此类损坏的数据放入 S3。

于 2013-05-28T10:01:42.713 回答