groovy - 查找和替换文件中的特殊字符

Question

我正在尝试在以 ISO-8859-1 编码的文件中查找和替换一些特殊字符，然后将结果写入以 UTF-8 编码的新文件：

package inv

class MigrationScript {

    static main(args) {
        new MigrationScript().doStuff();
    }

    void doStuff() {
        def dumpfile = "path to input file";
        def newfileP = "path to output file"

        def file = new File(dumpfile)
        def newfile = new File(newfileP)

        def x = [
            "þ":"ş",
            "ý":"ı",
            "Þ":"Ş",
            "ð":"ğ",
            "Ý":"İ",
            "Ð":"Ğ"
        ]

        def r = file.newReader("ISO-8859-1")
        def w = newfile.newWriter("UTF-8")

        r.eachLine{
            line ->

                x.each {
                    key, value ->
                    if(line.find(key)) println "found a special char!" 
                    line = line.replaceAll(key, value);
                }

                w << line + System.lineSeparator();
        }

        w.close()
    }
}

我的输入文件内容是：

"þ": "ý": "Þ":" "ð":" "Ý":" "Ð":"

问题是我的代码永远找不到指定的字符。groovy 脚本文件本身以 UTF-8 编码。我猜这可能是问题的原因，但是我不能在 ISO-8859-1 中对其进行编码，因为那时我不能在其中写“Ş”“Ğ”等。

score 1 · Accepted Answer

我拿了你的代码示例，用一个用字符集 ISO-8859-1 编码的输入文件运行它，它按预期工作。你能仔细检查你的输入文件是否真的用 ISO-8859-1 编码？这是我所做的：

我从您的问题中获取文件内容并/tmp/test.txt使用Save -> Save with Encoding -> Western (ISO 8859-1)将其（使用 SublimeText）保存到文件中

我使用以下 Linux 命令检查了文件编码：

file -i /tmp/test.txt
/tmp/test.txt: text/plain; charset=iso-8859-1

dumpfile我用/tmp/test.txt文件和newfile变量设置变量/tmp/test_2.txt

我运行您的代码，并在控制台中看到：

found a special char!
found a special char!
found a special char!
found a special char!
found a special char!
found a special char!

我在 IntelliJ IDEA 中检查了 Groovy 文件的编码——它是 UTF-8

我检查了输出文件的编码：

file -i /tmp/test_2.txt
/tmp/test_2.txt: text/plain; charset=utf-8

我检查了输出文件的内容：

cat /tmp/test_2.txt 
"ş": "ı": "Ş":" "ğ":" "İ":" "Ğ":"

我认为这并不重要，但我使用了最新的 Groovy 2.4.13

我猜你的输入文件没有正确编码。请仔细检查文件的编码是什么 - 当我保存相同的内容但使用 UTF-8 编码时，您的程序无法按预期工作，并且我found a special char!在控制台中看不到任何条目。当我显示 ISO-8859-1 文件的内容时，我看到了类似的内容：

cat /tmp/test.txt 
"�": "�": "�":" "�":" "�":" "�":"%

如果我用 UTF-8 保存相同的内容，我会看到文件的可读内容：

cat /tmp/test.txt
"þ": "ý": "Þ":" "ð":" "Ý":" "Ð":"%

希望它有助于找到问题的根源。

groovy - 查找和替换文件中的特殊字符

1 回答 1

Related

Reference