regex - 将 \r\n、\n 和 \t 替换为 " " 的 Groovy 脚本

Question

我正在使用 Apache NiFi 来构建我的数据流，而我目前正在处理的实际数据是由分隔值组成的。我想使用 ExecuteScript，为此我整理了一个简单的 Groovy 脚本，它应该执行以下操作：

1) 用竖线 (|) 替换当前分隔符

2) 将 \r\n 和 \tab 替换为 " "

此脚本的原因是对显示以下问题的数据集进行一些数据清理和争论：

\taba) 文本（通常很长）通过或跨行\r\n。这可能发生在句号之前，但并不一致。

b) 空行（目前脚本还没有涉及到这一点）

1) 很容易完成，但是 2) 的代码似乎没有删除表格和回车，我不知道为什么。这是代码：

import org.apache.nifi.processor.io.StreamCallback

import java.nio.charset.StandardCharsets

def flowFile = session.get()
if(!flowFile) return

flowFile = session.write(flowFile, {inputStream, outputStream ->
    inputStream.eachLine { line ->
        def a = line.replaceAll('\t', ' ').replaceAll('\r\n', ' ').replaceAll('¦', '|')
        outputStream.write("${a}\n".toString().getBytes(StandardCharsets.UTF_8))
    }
} as StreamCallback)

session.transfer(flowFile, REL_SUCCESS)

谢谢您的帮助。

score 2 · Accepted Answer

当您遍历线条时，eachLine您已经删除了所有线条\r以及它们上\n的eachLine分裂，然后依次为您提供结果。如果要删除换行符，则不能使用eachLine，或者可以简单地\n从write()调用中省略。

至于 '\t' 你确定那些真的是 '\t' 字符吗？

除此之外，你不应该使用replaceAll()你不使用正则表达式。改为使用replace()。

score 0 · Accepted Answer

我已经完成了这个脚本，但是它似乎LF在所有行的末尾都删除了所有内容并将内容输出到一行。我想知道您是否可以在代码中发现任何明显错误的地方。我希望\n仅在包含格式的行的末尾有一个：|digit

import org.apache.nifi.processor.io.StreamCallback
import java.nio.charset.StandardCharsets

def flowFile = session.get()
if (!flowFile) return

flowFile = session.write(flowFile, { inputStream, outputStream ->
    inputStream.eachLine { line ->
        def a = line.replace('\t', ' ').replace('¦', '|')
        if (${a}.endWith('\\d$'))
            outputStream.write("${a}\n".toString().getBytes(StandardCharsets.UTF_8))
        else {
            outputStream.write("${a}".toString().getBytes(StandardCharsets.UTF_8))
        }
    }

} as StreamCallback)

session.transfer(flowFile, REL_SUCCESS)

regex - 将 \r\n、\n 和 \t 替换为 " " 的 Groovy 脚本

2 回答 2

Related

Reference