1

我正在使用 s3 select 从 s3 JSON 文件中获取记录。当我从小 JSON 文件(即 2MB)(记录数约为 10000)中获取数据时,一切都对我有用

以下是我的查询

innerStart = 1
innerStop = 100
maximumLimit = 100
query = "SELECT * FROM s3object r where r.id > " + str(innerStart) + " and r.id <= " + str(innerStop) + " limit " + str(maximumLimit);
r = s3.select_object_content(
     Bucket=cache,
     Key= key + '.json',
     ExpressionType='SQL',
     Expression= query,
     InputSerialization={'JSON': {"Type": "Lines"}, 'CompressionType': 'NONE'},
     OutputSerialization={'JSON': {
     }},
)

但是当我尝试从大型 JSON 文件(即 100 MB 超过 578496 条记录)中查询一些记录时。我收到以下错误。我尝试更改我的查询以从大型 JSON 文件中仅获取一条记录,这对我也不起作用。S3 Select 是否有任何扫描字符限制?

文件“./app/main.py”,第 118 行,retrieve_from_cache_json OutputSerialization={'JSON': { 文件“/usr/local/lib/python3.7/site-packages/botocore/client.py”,第 357 行,在 _api_call 中返回 self._make_api_call(operation_name, kwargs) 文件“/usr/local/lib/python3.7/site-packages/botocore/client.py”,第 676 行,在 _make_api_call 中引发 error_class(parsed_response, operation_name) botocore。 exceptions.ClientError:调用SelectObjectContent操作时发生错误(OverMaxRecordSize):一条记录​​中的字符数超过我们的最大阈值,maxCharsPerRecord:1,048,576

示例 JSON 文件

{
        "id": 1,
        "hostname": "registry.in.",
        "subtype": "A",
        "value": "5.9.139.185",
        "passive_dns_count": "4",
        "count_total": 11,
        "count": 11
    }
    {
        "id": 2,
        "hostname": "registry.ctn.in.",
        "subtype": "A",
        "value": "18.195.87.188",
        "passive_dns_count": "2",
        "count_total": 11,
        "count": 11
    }
    
        "id": 3,
        "hostname": "registry.in.",
        "subtype": "NS",
        "value": "ns-243.awsdns-30.com.",
        "passive_dns_count": "6",
        "count_total": 11,
        "count": 11
    }
    ...
    ...
4

1 回答 1

1

我将我的 JSON 模式更改为 CSV,csv select 为我工作。以下是我的查询

innerStop = 100
innerStart = 0
maximumLimit = 100
query = "SELECT * FROM s3Object r WHERE cast(r.\"id\" as float) > " + str(innerStart) + " and cast(r.\"id\" as float) <=" + str(innerStop) + " limit " + str(maximumLimit);
    r = s3.select_object_content(
         Bucket=cache,
         Key= 'filename' + '.csv',
         ExpressionType='SQL',
         Expression= query,
         InputSerialization = {'CSV': {"FileHeaderInfo": "Use"}, 'CompressionType': 'NONE'},
         OutputSerialization = {'CSV': {}},
    )
    for event in r['Payload']:
     if 'Records' in event:
         records = event['Records']['Payload'].decode('utf-8')
于 2021-06-05T11:04:19.167 回答