java - 在 Windows 上使用 Java 读取 EUC 编码的 HTML

Question

我正在尝试从 URL 读取以 EUC-KR 编码的 HTML 文件。当我在 IDE 中编译代码时，我得到了所需的输出，但是当我构建一个 jar 并尝试运行该 jar 时，我读取的数据显示为问号（“????”而不是韩语字符）。我假设这是由于编码丢失。

该网站的元数据如下：

 <meta http-equiv="Content-Type" content="text/html; charset=euc-kr">

这是我的代码：

  String line;
  URL u = new URL("link to the site");
  InputStream in = u.openConnection().getInputStream();
  BufferedReader r = new BufferedReader(new InputStreamReader(in, "EUC-KR"));
  while ((line = r.readLine()) != null) {
    /*send the string to a text area*/--> This works fine now
    /*take the string and pass it thru ByteArrayInputStream*/ --> this is where I believe the encoding is lost.

    InputStream xin = new ByteArrayInputStream(thestring.getBytes("EUC-KR"));
    Reader reader = new InputStreamReader(xin);
    EditorKit kit = new HTMLEditorKit();
    HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
    kit.read(reader, doc, 0);
    HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.STRONG);

    while (it.isValid()) {
      chaps.add(doc.getText(it.getStartOffset(), it.getEndOffset() - it.getStartOffset()).trim());
      //chaps is a arraylist<string>
      it.next();
    }

如果有人能帮助我弄清楚如何在独立于系统默认编码的任何平台上运行应用程序时在不丢失编码的情况下获取字符，我将不胜感激。

谢谢

PS：当程序作为 jar 运行时，在 IDE 中运行时显示系统编码为 Cp1252 和 UTF-8。

score 3 · Accepted Answer

InputStream xin = new ByteArrayInputStream(thestring.getBytes("EUC-KR"));
Reader reader = new InputStreamReader(xin);

这是一个转码错误。您将字符串编码为“EUC-KR”并使用系统编码对其进行解码（导致垃圾）。为避免这种情况，您必须将编码传递给InputStreamReader。

但是，最好避免所有编码和解码，而只使用StringReader。

java - 在 Windows 上使用 Java 读取 EUC 编码的 HTML

1 回答 1

Related

Reference