2

我将此 json 存储在 S3 文件中(这实际上是 aws Comprehend EntitiesDetection 作业的输出 => 意味着我无法控制此 json 的组织方式,它由 AWS 作业本身上传到 S3,所以我可以t 修改此 json 输入的结构):

{"Entities": 
  [
    {"BeginOffset": 1, "EndOffset": 11, "Score": 0.9815415143966675, "Text": "5 start-up", "Type": "QUANTITY"}, {"BeginOffset": 61, "EndOffset": 183, "Score": 0.8883988261222839, "Text": "https://www.smartadserver.com/ac?jump=1&nwid=33&siteid=99773&pgname=other&fmtid=35357&visit=m&tmstp=1568017721&out=nonrich", "Type": "OTHER"}, {"BeginOffset": 212, "EndOffset": 327, "Score": 0.8162660002708435, "Text": "https://www.smartadserver.com/ac?out=nonrich&nwid=33&siteid=99773&pgname=other&fmtid=35357&visit=m&tmstp=1568017721", "Type": "OTHER"}, {"BeginOffset": 337, "EndOffset": 339, "Score": 0.7018660306930542, "Text": "Trump, "Type": "PERSON"}, {"BeginOffset": 364, "EndOffset": 484, "Score": 0.8932908177375793, "Text": "https://www.smartadserver.com/ac?jump=1&nwid=33&siteid=99773&pgname=other&fmtid=247&visit=m&tmstp=1568017721&out=nonrich", "Type": "OTHER"}, {"BeginOffset": 513, "EndOffset": 626, "Score": 0.8157837986946106, "Text": "https://www.smartadserver.com/ac?out=nonrich&nwid=33&siteid=99773&pgname=other&fmtid=247&visit=m&tmstp=1568017721", "Type": "OTHER"}, {"BeginOffset": 636, "EndOffset": 638, "Score": 0.6977631449699402, "Text": "Oprah Winfrey", "Type": "PERSON"}, {"BeginOffset": 963, "EndOffset": 971, "Score": 0.4658013880252838, "Text": "facebook", "Type": "ORGANIZATION"}, {"BeginOffset": 972, "EndOffset": 979, "Score": 0.6886632442474365, "Text": "twitter", "Type": "TITLE"}, {"BeginOffset": 985, "EndOffset": 993, "Score": 0.7970104813575745, "Text": "linkedin", "Type": "ORGANIZATION"}, {"BeginOffset": 994, "EndOffset": 998, "Score": 0.36566048860549927, "Text": "Menu", "Type": "TITLE"}
  ],
  "File": "inputs/stratgies-5-start-up-qui-allient-tech-et-odorat-a634acaa-6549-4c89-93b3-8951ababa032"},


{"Entities": 
  [
    {"BeginOffset": 1, "EndOffset": 13, "Score": 0.9995881915092468, "Text": "Nabil Karoui", "Type": "PERSON"}, {"BeginOffset": 27, "EndOffset": 69, "Score": 0.8302029371261597, "Text": "Constitution \u00e9conomique\" - African Manager", "Type": "TITLE"}, {"BeginOffset": 94, "EndOffset": 126, "Score": 0.48702114820480347, "Text": ".wpb_animate_when_almost_visible", "Type": "OTHER"}, {"BeginOffset": 290, "EndOffset": 298, "Score": 0.47538018226623535, "Text": "Fran\u00e7ais", "Type": "OTHER"}, {"BeginOffset": 299, "EndOffset": 306, "Score": 0.6746407747268677, "Text": "English", "Type": "OTHER"}, {"BeginOffset": 464, "EndOffset": 476, "Score": 0.9992197155952454, "Text": "Nabil Karoui", "Type": "PERSON"}, {"BeginOffset": 515, "EndOffset": 527, "Score": 0.9994662404060364, "Text": "Nabil Karoui", "Type": "PERSON"}, {"BeginOffset": 581, "EndOffset": 596, "Score": 0.6652442812919617, "Text": "African Manager", "Type": "ORGANIZATION"}, {"BeginOffset": 599, "EndOffset": 615, "Score": 0.8012278079986572, "Text": "09/09/2019 08:45", "Type": "DATE"}, {"BeginOffset": 674, "EndOffset": 685, "Score": 0.8724801540374756, "Text": "tunisiennes", "Type": "OTHER"}, {"BeginOffset": 689, "EndOffset": 701, "Score": 0.9975908398628235, "Text": "15 septembre", "Type": "DATE"}, {"BeginOffset": 753, "EndOffset": 781, "Score": 0.9481445550918579, "Text": "certain nombre d\u2019initiatives", "Type": "QUANTITY"}
  ],
  "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"}

//and so on ...

我需要查找并检索所有类型=PERSON且得分> 0.7的文件,并检索以下数据:人员和文件。

今天我的查询表达式是:

select s.Text from s3object[*].Entities[*] s where s.Type= 'PERSON' AND s.Score > 0.7;

这输出:

[

    {
        "Text": "Trump"
    },
    {
        "Text": "Oprah winfrey
    },
    {
        "Text": "Nabil Karoui"
    },
    {
        "Text": "Nabil Karoui"
    },
    {
        "Text": "Nabil Karoui"
    },
    {
        "Text": "Nabil Karoui"
    },

]

这部分很好,但我需要将每个“文本”(人名)与它来自的文件相关联。所以我期望的查询输出是:

[

    {
        "Text": "Trump",
        "File": "inputs/stratgies-5-start-up-qui-allient-tech-et-odorat-a634acaa-6549-4c89-93b3-8951ababa032"
    },
    {
        "Text": "Oprah winfrey,
        "File": "inputs/stratgies-5-start-up-qui-allient-tech-et-odorat-a634acaa-6549-4c89-93b3-8951ababa032"
    },
    {
        "Text": "Nabil Karoui",
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },
    {
        "Text": "Nabil Karoui"
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },
    {
        "Text": "Nabil Karoui",
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },
    {
        "Text": "Nabil Karoui",
        "File": "inputs/african-manager-nabil-karoui-propose-une-constitution-conomique-6c5b3dc2-1929-4cea-b421-5cd04040f2e2"
    },

]

如何找回这个?使用https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html尝试了很多可能性,但都没有奏效。

4

1 回答 1

2

您分享的此页面中有一条注释:

Note Amazon S3 Select 和 Glacier Select 查询目前不支持子查询或联接。

我会直接设置Athena以对 S3 进行更复杂的查询(来自官方文档的示例)。另一种选择是以可以避免连接的方式重组 JSON,例如在“文本”级别复制“文件”。当然,您还可以在许多其他工具和格式中索引此 JSON,以使数据可搜索/“可查询”。

于 2019-09-11T12:23:31.507 回答