java - StringEscapeUtils.unescapeHtml 不适用于从文件中读取的字符串

Question

我正在尝试读取包含 unicode 字符的文件，将这些字符转换为相应的符号，然后将生成的文本打印到新文件中。我正在尝试使用 StringEscapeUtils.unescapeHtml 来执行此操作，但这些行只是按原样打印，unicode 点仍然完好无损。我做了一个练习，从文件中复制一行，从中创建一个字符串，然后在上面调用 StringEscapeUtils.unescapeHtml，效果很好。我的代码如下：

    class FileWrite 
{
 public static void main(String args[])
  {
  try{
      String testString = " \"text\":\"Dude With Knit Hat At Party Calls Beer \u2018Libations\u2019 http://t.co/rop8NSnRFu\" ";

      FileReader instream = new FileReader("Home Timeline.txt");
      BufferedReader b = new BufferedReader(instream);

      FileWriter fstream = new FileWriter("out.txt");
      BufferedWriter out = new BufferedWriter(fstream);

      out.write(StringEscapeUtils.unescapeHtml3(testString) + "\n");//This gives the desired output,
                                                                    //with unicode points converted
      String line = b.readLine().toString();

      while(line != null){
        out.write(StringEscapeUtils.unescapeHtml3(line) + "\n");
        line = b.readLine();
      }

      //Close the output streams
      b.close();
      out.close();
  }
  catch (Exception e){//Catch exception if any
    System.err.println("Error: " + e.getMessage());
  }
  }
}

score 2 · Accepted Answer

//This gives the desired output,
//with unicode points converted
out.write(StringEscapeUtils.unescapeHtml3(testString) + "\n");

你误会了。Java 在编译时将这种形式的字符串文字转义到类文件中时：

"\u2018Libations\u2019"

此代码中没有HTML 3转义。您选择的方法旨在取消转义表单的转义序列‘。

您可能需要unescapeJava方法。

score 1 · Accepted Answer

您正在使用您的平台默认编码读取和写入字符串。您要明确指定要用作“UTF-8”的字符集：

输入流：

BufferedReader b = new BufferedReader(new InputStreamReader(
        new FileInputStream("Home Timeline.txt"),
        Charset.forName("UTF-8")));

输出流：

BufferedWriter out = new BufferedWriter(new OutputStreamWriter(
        new FileOutputStream("out.txt"),
        Charset.forName("UTF-8")));

java - StringEscapeUtils.unescapeHtml 不适用于从文件中读取的字符串

2 回答 2

Related

Reference