2

(This question is very similar to Store multiple elements in json files in AWS Athena)

I have a JSON file in an S3 bucket that is structured like this -

[{"key1": value, "key2": value, "key3": {"key4": value, etc}}, {"key1": value....}]

Two questions:

  1. Why is it that if I send this directly to Quicksight, Quicksight knows to normalize the file perfectly (unless there are multiple files in the bucket that don't match (which is why I'm trying Athena)) but Athena struggles with it? I know it has something to do with formatting (each dictionary isn't on its own line, it's a list of dictionaries and not just dictionaries, etc) but it seems so unnecessary to modify the original file if there is another service on AWS that knows how to parse and flatten it without an issue.

  2. I'm using a Python script in Lambda to call the API and the list of dictionaries is the format it comes in. Is there a simple way to format the JSON file in the way Athena likes? Below is an example of my current script -

response = requests.request(method, url, **kwargs)
data_dict = response.json()
data_json = json.dumps(data_dict['results'])
s3.Bucket('bucket_name').put_object(Key = key, Body = data_json)

Disclaimer: I am fairly new to AWS/coding in general and have encountered many challenges while trying to understand the AWS documentation, so my apologies if this is a simple fix.

4

1 回答 1

5

Athena and Quicksight have different back ends so this explains the difference in behavior.

The issue with Athena is that each JSON record needs to be on its own line and not wrapped up inside of a JSON array. I have created lambdas to "flatten" out JSON that I pull off of a stream, similar to your problem.

Here is some sample code that should do the trick for you to make the data more compatible with Athena (this code is unrun/untested but hope it gives you the right idea):

client = boto3.client('s3')
response = requests.request(method, url, **kwargs)
data_dict = response.json()

with open('/tmp/out.json', 'w') as output:
    for result in data_dict['results']:
        output.write(json.dumps(result))

client.upload_file('/tmp/out.json', 'bucket_name', key)

Keep in mind that Athena does not like keys/column names that have . in them so if you have any in your data, you may need to massage your data before storing it in s3.

If your JSON is nested, as you example indicates with key3, you may also want to look into the flattening your JSON before storing it in S3 with something like flatten_json. Athena will let you query nested JSON just fine but some other tools like Quicksight may not handle complex nested columns.

于 2019-05-23T14:26:14.283 回答