regex - 在映射文件中使用带有多个列的 ReplaceTextWithMapping

Question

在我的具体情况下，我需要澄清 NiFi 中 ReplaceTextWithMapping 的用法。我的输入文件如下所示：

{"field1" : "A",
"field2" : "A",
"field3": "A"
}

相反，映射文件看起来像这样：

 Header1;Header2;Header3
 A;some text;2

我的预期结果如下：

   {"field1" : "some text",
    "field2": "A",
    "field3": "A2"
    }

正则表达式集简单如下：

[A-Z0-9]+

并且它与映射文件中的字段键匹配（我们期望大写字母或大写字母 + 数字），但是我不确定您如何决定要分配哪个值（来自 col 2 或来自 col3）输入值。此外，我的 field2 不应更改，并且需要保留从输入值获得的相同值，不涉及映射。目前，我得到这样的东西：

  {"field1" : "some text A2",
    "field2": "some text A2",
    "field3": "some text A2"
    }

我想我的主要问题是：您可以将输入文件中的相同值映射为来自映射文件不同列的不同值吗？

谢谢

编辑：我正在使用ReplaceTextWithMapping，这是 Apache NiFi（v. 0.5.1）中的一个开箱即用的处理器。在我的数据流中，我最终得到了一个 Json 文件，我需要在该文件上应用一些来自我想加载到内存中的外部文件的映射（例如，而不是使用 ExtractText 解析）。

score 2 · Accepted Answer

向前

看起来您正在使用 JSON 字符串，通过 JSON 解析引擎使用这样的字符串会更容易，因为 JSON 结构允许创建困难的边缘情况，这使得使用正则表达式进行解析变得困难。话虽如此，我相信你有你的理由，我不是正则表达式警察。

描述

要进行这样的替换，捕获要保留的子字符串和要替换的子字符串会更容易。

(\{"[a-z0-9]+"\s*:\s*")([a-z0-9]+)("[,\r\n]+"[a-z0-9]+"\s*:\s*")([a-z0-9]+)("[,\r\n]+"[a-z0-9]+"\s*:\s*")([a-z0-9]+)("[,\r\n]+\})

用。。。来代替：$1SomeText$3$4$5A2$7

正则表达式可视化

注意：我建议在此表达式中使用以下标志：不区分大小写，点匹配所有字符，包括换行符。

示例

现场德诺

此示例显示正则表达式如何与您的源文本匹配： https ://regex101.com/r/vM1qE2/1

源文本

{"field1" : "A",
"field2" : "A",
"field3": "A"
}

更换后

{"field1" : "SomeText",
"field2" : "A",
"field3": "A2"
}

解释

NODE                     EXPLANATION
----------------------------------------------------------------------
  (                        group and capture to \1:
----------------------------------------------------------------------
    \{                       '{'
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [a-z0-9]+                any character of: 'a' to 'z', '0' to '9'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    :                        ':'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
  )                        end of \1
----------------------------------------------------------------------
  (                        group and capture to \2:
----------------------------------------------------------------------
    [a-z0-9]+                any character of: 'a' to 'z', '0' to '9'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
  )                        end of \2
----------------------------------------------------------------------
  (                        group and capture to \3:
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [,\r\n]+                 any character of: ',', '\r' (carriage
                             return), '\n' (newline) (1 or more times
                             (matching the most amount possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [a-z0-9]+                any character of: 'a' to 'z', '0' to '9'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    :                        ':'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
  )                        end of \3
----------------------------------------------------------------------
  (                        group and capture to \4:
----------------------------------------------------------------------
    [a-z0-9]+                any character of: 'a' to 'z', '0' to '9'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
  )                        end of \4
----------------------------------------------------------------------
  (                        group and capture to \5:
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [,\r\n]+                 any character of: ',', '\r' (carriage
                             return), '\n' (newline) (1 or more times
                             (matching the most amount possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [a-z0-9]+                any character of: 'a' to 'z', '0' to '9'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    :                        ':'
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
  )                        end of \5
----------------------------------------------------------------------
  (                        group and capture to \6:
----------------------------------------------------------------------
    [a-z0-9]+                any character of: 'a' to 'z', '0' to '9'
                             (1 or more times (matching the most
                             amount possible))
----------------------------------------------------------------------
  )                        end of \6
----------------------------------------------------------------------
  (                        group and capture to \7:
----------------------------------------------------------------------
    "                        '"'
----------------------------------------------------------------------
    [,\r\n]+                 any character of: ',', '\r' (carriage
                             return), '\n' (newline) (1 or more times
                             (matching the most amount possible))
----------------------------------------------------------------------
    \}                       '}'
----------------------------------------------------------------------
  )                        end of \7

score 0 · Accepted Answer

所以我潜入 ReplaceTextWithMapping 试图让它解决你的用例，但我只是不认为它足够强大来做你想做的事。目前它的设计几乎完全是为了：匹配一个简单的正则表达式，将一组非空白字符映射到另一组字符（可以有空白和反向引用）。

当将您的用例视为纯文本时，它是根据另一个捕获组的值和映射文件来更改一个捕获组的值。从 JSON 的角度来看，您的用例要简单得多，您希望根据键和映射文件来更改键/值对的值。旁注，如果您不需要映射文件，我相信 0.7.0[1] 中会有一个新的 JSON 到 JSON 处理器可以工作。

至于寻找解决方案，两种看待问题的方式都是有效的。ReplaceTextWithMapping 当然可以使用扩展功能来支持高级用例，但可能会使其过于复杂（尽管由于其功能范围不明确，现在可能会更加混乱）。当然也可以添加一个类似于“ReplaceJsonWithMapping”的新处理器，但需要明确定义它的范围和目的。

此外，对于更直接的解决方案，始终可以选择使用 ExecuteScript 处理器。这里[2] 是一个博客链接（由 ExecuteScript 的创建者编写），它概述了如何编写基本的 JSON 到 JSON 处理器。需要添加更多逻辑才能读取映射文件。

[1] https://issues.apache.org/jira/browse/NIFI-361 [2] http://funnifi.blogspot.com/2016/02/executescript-json-to-json-conversion.html

regex - 在映射文件中使用带有多个列的 ReplaceTextWithMapping

2 回答 2

向前

描述

示例

解释

Related

Reference