elasticsearch - 从字段数组中提取文本

Question

称为“资源”的字段之一具有以下 2 个内部文档。

  {
  "type": "AWS::S3::Object",
  "ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
},
{
  "accountId": "934331768510612",
  "type": "AWS::S3::Bucket",
  "ARN": "arn:aws:s3:::sms_vild"
}

我需要拆分 ARN 字段并获取它的最后一部分。即“ reports_201706.schema ”最好使用脚本字段。

我试过的：

1）我检查了文件列表，发现只有2个条目resources.accountId和resources.type

2）我尝试使用日期时间字段，它在脚本文件选项（表达式）中正常工作。

doc['eventTime'].value

3）但同样不适用于其他文本字段，例如

doc['eventType'].value

收到此错误：

"caused_by":{"type":"script_exception","reason":"link error","script_stack":["doc['eventType'].value","^---- HERE"],"script":"doc['eventType'].value","lang":"expression","caused_by":{"type":"illegal_argument_exception","reason":"Fielddata is disabled on text fields by default. Set fielddata=true on [eventType] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory."}}},"status":500}

这意味着我需要更改映射。有没有其他方法可以从对象中的嵌套数组中提取文本？

更新：

请在此处访问示例 kibana...

https://search-accountact-phhofxr23bjev4uscghwda4y7m.us-east-1.es.amazonaws.com/_plugin/kibana/

搜索“ebs_attach.png”，然后检查资源字段。您将看到 2 个这样的嵌套数组...

 {
  "type": "AWS::S3::Object",
  "ARN": "arn:aws:s3:::datameetgeo/ebs_attach.png"
},
{
  "accountId": "513469704633",
  "type": "AWS::S3::Bucket",
  "ARN": "arn:aws:s3:::datameetgeo"
}

我需要拆分 ARN 字段并提取最后一部分，即“ebs_attach.png”

如果我可以以某种方式将其显示为脚本字段，那么我可以在发现选项卡上并排看到存储桶名称和文件名。

更新 2

换句话说，我正在尝试将此图像中显示的文本提取为发现选项卡上的新字段。

score 2 · Accepted Answer

虽然您可以为此使用脚本，但我强烈建议您在索引时提取这些信息。我在这里提供了两个示例，它们远非故障安全（您需要使用不同的路径进行测试或完全缺少该字段），但它应该提供一个基础开始

PUT foo/bar/1
{
  "resources": [
    {
      "type": "AWS::S3::Object",
      "ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
    },
    {
      "accountId": "934331768510612",
      "type": "AWS::S3::Bucket",
      "ARN": "arn:aws:s3:::sms_vild"
    }
  ]
}

# this is slow!!!
GET foo/_search
{
  "script_fields": {
    "document": {
      "script": {
        "inline": "return params._source.resources.stream().filter(r -> 'AWS::S3::Object'.equals(r.type)).map(r -> r.ARN.substring(r.ARN.lastIndexOf('/') + 1)).findFirst().orElse('NONE')"
      }
    }
  }
}

# Do this on index time, by adding a pipeline
PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "processors" : [
    {
      "script" : {
        "inline": "ctx.filename = ctx.resources.stream().filter(r -> 'AWS::S3::Object'.equals(r.type)).map(r -> r.ARN.substring(r.ARN.lastIndexOf('/') + 1)).findFirst().orElse('NONE')"
      }
    }
  ]
}

# Store the document, specify the pipeline
PUT foo/bar/1?pipeline=my-pipeline-id
{
  "resources": [
    {
      "type": "AWS::S3::Object",
      "ARN": "arn:aws:s3:::sms_vild/servers_backup/db_1246/db/reports_201706.schema"
    },
    {
      "accountId": "934331768510612",
      "type": "AWS::S3::Bucket",
      "ARN": "arn:aws:s3:::sms_vild"
    }
  ]
}

# lets check the filename field of the indexed document by getting it
GET foo/bar/1

# We can even search for this file now
GET foo/_search
{
  "query": {
    "match": {
      "filename": "reports_201706.schema"
    }
  }
}

score 0 · Accepted Answer

注意：考虑的“资源”是一种数组

NSArray *array_ARN_Values = [resources valueForKey:@"ARN"];

希望它对你有用！！！

elasticsearch - 从字段数组中提取文本

2 回答 2

Related

Reference