我正在尝试从 URL 读取以 EUC-KR 编码的 HTML 文件。当我在 IDE 中编译代码时,我得到了所需的输出,但是当我构建一个 jar 并尝试运行该 jar 时,我读取的数据显示为问号(“????”而不是韩语字符)。我假设这是由于编码丢失。
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
String line;
URL u = new URL("link to the site");
InputStream in = u.openConnection().getInputStream();
BufferedReader r = new BufferedReader(new InputStreamReader(in, "EUC-KR"));
while ((line = r.readLine()) != null) {
/*send the string to a text area*/--> This works fine now
/*take the string and pass it thru ByteArrayInputStream*/ --> this is where I believe the encoding is lost.
InputStream xin = new ByteArrayInputStream(thestring.getBytes("EUC-KR"));
Reader reader = new InputStreamReader(xin);
EditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
kit.read(reader, doc, 0);
HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.STRONG);
while (it.isValid()) {
chaps.add(doc.getText(it.getStartOffset(), it.getEndOffset() - it.getStartOffset()).trim());
//chaps is a arraylist<string>
PS:当程序作为 jar 运行时,在 IDE 中运行时显示系统编码为 Cp1252 和 UTF-8。