logstash - logstash：将逗号分隔的字符串从 MySQL 导入弹性搜索作为数组

Question

我正在尝试将逗号分隔的字符串 ( GROUP_CONCAT) 作为数组数据类型插入到 elasticsearch 中。作为输入，我使用 JDBC，SQL 查询的输出如下：

+---------+-----------+------------+--------------------------+-------------+---------------------+---------+------------+----------+---------------------+-------------+---------+----------------------------------------+
| network | post_dbid | host_dbid  | host_netid               | post_netid  | published           | n_likes | n_comments | language | indexed             | n_harvested | country | vrt                                    |
+---------+-----------+------------+--------------------------+-------------+---------------------+---------+------------+----------+---------------------+-------------+---------+----------------------------------------+
| xxx     | 2_xxx     | 60480_xxx  | xxxxxxxxxxxxxxxxxxxxxxxx | xxxxxxxxxxx | 2017-12-28 08:11:58 | 5       | 0          | en       | 2018-05-30 00:00:00 | 0           | ID      | Fitness,Well-being                     |
| xxx     | 5_xxx     | 98458_xxx  | xxxxxxxxxxxxxxxxxxxxxxxx | xxxxxxxxxxx | 2016-09-01 11:59:14 | 2275    | 242        | ar       | 2018-05-30 00:00:00 | 0           | SA      | SmartPhones_Gadgets                    |
| xxx     | 15_xxx    | 50884_xxx  | xxxxxxxxxxxxxxxxxxxxxxxx | xxxxxxxxxxx | 2018-04-23 16:36:10 | 0       | 0          | en       | 2018-05-30 00:00:00 | 0           | EG      | Fashion_Beauty                         |
| xxx     | 21_xxx    | 64118_xxx  | xxxxxxxxxxxxxxxxxxxxxxxx | xxxxxxxxxxx | 2015-07-01 22:50:54 | 295     | 8          | pt       | 2018-05-30 00:00:00 | 0           | BR      | Nutrition                              |
| xxx     | 24_xxx    | 9767_xxx   | xxxxxxxxxxxxxxxxxxxxxxxx | xxxxxxxxxxx | 2017-05-30 02:35:29 | 10      | 1          | en       | 2018-06-18 15:32:57 | 0           | US      | Health                                 |
| xxx     | 87_xxx    | 44473_xxx  | xxxxxxxxxxxxxxxxxxxxxxxx | xxxxxxxxxxx | 2017-01-08 23:02:52 | 7       | 0          | en       | 2018-05-30 00:00:00 | 0           | US      | Beverages                              |
| xxx     | 99_xxx    | 120198_xxx | xxxxxxxxxxxxxxxxxxxxxxxx | xxxxxxxxxxx | 2018-02-17 02:57:58 | 8       | 0          | en       | 2018-05-30 00:00:00 | 0           | US      | Food                                   |
| xxx     | 126_xxx   | 50258_xxx  | xxxxxxxxxxxxxxxxxxxxxxxx | xxxxxxxxxxx | 2018-03-22 09:16:25 | 1       | 0          | en       | 2018-05-30 00:00:00 | 0           | IN      | Health                                 |
+---------+-----------+------------+--------------------------+-------------+---------------------+---------+------------+----------+---------------------+-------------+---------+----------------------------------------+

我使用split了 mutate 插件：

filter {
  mutate {
     split => { "vrt" => "," }
  }
}

虽然，字段被插入为逗号分隔的字符串：

GET xxx/_search
{
  "query": {
    "terms": {
      "_id": ["2_xxx"] 
    }
  }
}

回应：

{
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "xxx",
        "_type": "doc",
        "_id": "2_xxx",
        "_score": 1,
        "_source": {
          "post_dbid": "2_xxx",
          "host_dbid": "60480_xxx",
          "host_netid": "xxxxxxxxxxxxxxxxxxxxxxxx",
          "n_likes": 5,
          "n_comments": 0,
          "country": "ID",
          "network": "xxx",
          "indexed": "2018-05-30T00:00:00.000Z",
          "n_harvested": 0,
          "vrt": "Fitness,Well-being",
          "@version": "1",
          "post_netid": "xxxxxxxxxxx",
          "@timestamp": "2018-06-27T15:47:24.370Z",
          "language": "en",
          "published": "2017-12-28T08:11:58.000Z"
        }
      }
    ]
  }
}

我的最终目标是插入vrt为数组字段并使用 kibana 来创建可视化。例如，我想在 kibana 上创建一个计数器并计算有多少文档在vrt字段上具有“Fitness”。

麋鹿版本：6.2.4

score 2 · Accepted Answer

你可以使用红宝石过滤器。这是我的做法。我创建了一个 ruby 方法，它拆分逗号分隔的字符串、修剪、拒绝空元素并删除重复项。然后，您可以对所有逗号分隔的字符串使用该方法，如下所示：

 filter {
 ruby{
     code =>"

         # method to split the supplied string by comma, trim whitespace and return an array
         def mapStringToArray(strFieldValue)

            #if string is not null, return array
            if (strFieldValue != nil)
                fieldArr =  strFieldValue.split(',').map(&:strip).reject(&:empty?).uniq 
                return fieldArr                             
            end     

            return [] #return empty array if string is nil
         end

         vrtArr = mapStringToArray(event.get('vrt'))
         if vrtArr.length > 0                           
            event.set('vrt', vrtArr) 
         end
"
}
}

logstash - logstash：将逗号分隔的字符串从 MySQL 导入弹性搜索作为数组

1 回答 1

Related

Reference