0

我有一个我无法解决的问题..有人可以帮助我吗?

好的,所以我正在尝试使用一个程序来规范化我的文本,它会删除多个空格,打印原始文件中的其他字符,并放置空格以及开始和结束符号。

所以转换,在我写好txt文件并打开后,我看到了这样的内容:

numa situaã § ã £ o de emergãªncia mã © dica

如您所见,有一些我不想要的奇怪字符,也许是因为编码?这是我的语言葡萄牙语的文本。

这是我的代码,我该如何解决?

public static void main(String[] args) throws IOException {

        Charset encoding = Charset.defaultCharset();

        InputStream in = new FileInputStream(new File("data.txt"));
        Reader reader = new InputStreamReader(in, encoding);
        Reader buffer = new BufferedReader(reader);
        StringBuilder normalizedLanguage = new StringBuilder("<");
        int r;
        while ((r = buffer.read()) != -1) {
            char ch = (char) r;




            boolean newline = false;
            boolean hasLetterBefore = false;
            boolean hasLetterAfter = false;
            char symbol = '-';
            int lines = 0;

            if (newline)
            {
                normalizedLanguage.append("\n<");
            }


            if (ch == '\r' || ch == '\n' )
            {
                lines++;
                normalizedLanguage.append(">");
                newline = true;
                hasLetterBefore = false;


            }
            else if (Character.isLetterOrDigit(ch))
            {
                if (hasLetterBefore == true)
                {
                    normalizedLanguage.append(Character.toString(symbol) + Character.toString(Character.toLowerCase(ch)));
                }else{
                    normalizedLanguage.append(Character.toString(Character.toLowerCase(ch)));
                }


                newline = false;
                hasLetterBefore = true;
            }
            else if (ch == ' ')
            {
                normalizedLanguage.append(Character.toString(ch));
                newline = false;
                hasLetterBefore = false;
            }
            else if (ch == '\t')
            {
                System.out.println("Tab detected: " + ch);
                newline = false;
                hasLetterBefore = false;
            }
            else
            {
                //Símbolos, entre outros..
                if (!hasLetterBefore)
                {
                    normalizedLanguage.append(" " + Character.toString(ch) + " ");
                }
                else
                {
                    symbol = ch;
                }
                newline = false;

            }


        }

        String normalizedLanguageString = normalizedLanguage.toString().trim().replaceAll(" +", " ");

        PrintWriter out = new PrintWriter("data_after.txt");

        out.println(normalizedLanguageString);
        out.close();

        buffer.close();
        reader.close();
        in.close();

    }

非常感谢你;)

4

1 回答 1

0

使用另一个字符集编码解决了这个问题:)

更改此行:

Charset encoding = Charset.defaultCharset();

至:

Charset encoding = Charset.forName("UTF8");

无论如何都非常感谢

于 2014-10-23T11:04:32.023 回答