java - 解析 UTF-8 编码的 XML 文件

Question

我有一个 XML 文件，其中包含从 URL 检索到的一些阿拉伯字符，因此我必须将其编码为 UTF-8，以便它可以处理这些字符。

XML 文件：

<Entry>

    <lstItems>            
           <item>
        <id>1</id>
            <title>News Test 1</title>
            <subtitle>16/7/2012</subtitle>
        <img>joelle.mobi-mind.com/imgs/news1.jpg</img>
           </item>
           <item>
        <id>2</id>
            <title>كريم</title>
            <subtitle>16/7/2012</subtitle>
        <img>joelle.mobi-mind.com/imgs/news2.jpg</img>
           </item>
           <item>
        <id>3</id>
            <title>News Test 333</title>
            <subtitle>16/7/2012</subtitle>
        <img>joelle.mobi-mind.com/imgs/news3.jpg</img>
           </item> 
           <item>
        <id>4</id>
            <title>ربيع</title>
            <subtitle>16/7/2012</subtitle>
        <img>joelle.mobi-mind.com/imgs/cont20.jpg</img>
           </item> 
           <item>
        <id>5</id>
            <title>News Test 55555</title>
            <subtitle>16/7/2012</subtitle>
        <img>joelle.mobi-mind.com/imgs/cont21.jpg</img>
           </item>      
           <item>
        <id>6</id>
            <title>News Test 666666</title>
            <subtitle>16/7/2012</subtitle>
        <img>joelle.mobi-mind.com/imgs/cont22.jpg</img>
           </item>               
    </lstItems>
  </Entry>

我将从 URL 检索到的 XML 解析为字符串，如下所示：

public String getXmlFromUrl(String url) {

    try {
        return new AsyncTask<String, Void, String>() {
            @Override
            protected String doInBackground(String... params) {
                //String xml = null;
                try {
                    DefaultHttpClient httpClient = new DefaultHttpClient();
                    HttpGet httpPost = new HttpGet(params[0]);
                    HttpResponse httpResponse = httpClient.execute(httpPost);
                    HttpEntity httpEntity = httpResponse.getEntity();
                    xml = new String(EntityUtils.toString(httpEntity).getBytes(),"UTF-8");


                } catch (Exception e) {
                    e.printStackTrace();
                }
                return xml;




            }
        }.execute(url).get();
    } catch (InterruptedException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (ExecutionException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return xml;
}

现在将返回的 String 传递给该方法以获取 Document 供以后使用，如下所示：

public Document getDomElement(String xml){

        Document doc = null;
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();

        try {

            DocumentBuilder db = dbf.newDocumentBuilder();
            InputSource is = new InputSource();
            StringReader xmlstring=new StringReader(xml);
            is.setCharacterStream(xmlstring);
            is.setEncoding("UTF-8");
                    //Code Stops here !
            doc = db.parse(is); 


        } catch (ParserConfigurationException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        } catch (SAXException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        } catch (IOException e) {
            Log.e("Error: ", e.getMessage());
            return null;
        }
        // return DOM
        return doc;

}

出现此消息的错误：

09-18 07:51:40.441: E/Error:(1210): Unexpected token (position:TEXT ï»¿@1:4 in java.io.StringReader@4144c240)

所以代码在我上面显示的地方崩溃并出现以下错误

09-18 07:51:40.451: E/AndroidRuntime(1210): java.lang.RuntimeException: Unable to start activity ComponentInfo{com.example.university1/com.example.university1.MainActivity}: java.lang.NullPointerException

请注意，该代码适用于 ISO 编码。

score 2 · Accepted Answer

您已在 UTF-8 文件中添加了BOM。哪个不好。

也许你用记事本编辑了你的文件，或者你应该检查你的编辑器以确保它没有添加 BOM。

由于 BOM 似乎在文本内而不是在开始时，您还需要使用删除它的位置周围的键来删除它（它在大多数编辑器中是不可见的）。这可能发生在文件连接操作期间。

score 1 · Accepted Answer

这可能不是问题，但EntityUtils.toString(httpEntity).getBytes()使用的是默认平台编码。您应该使用EntityUtils.toString(httpEntity)as String，无需将其转换为字节。

此外，请阅读此http://kunststube.net/encoding/以了解正在发生的事情的有用背景。

java - 解析 UTF-8 编码的 XML 文件

2 回答 2

Related

Reference