java - 解析开始时包含一些未知字符的 XML 文件

Question

我正在尝试解析一个 xml 文件，该文件在开始时有一些未知字符，我收到了错误

isjava.lang.IllegalStateException：到目前为止没有成功匹配

这是xml文件

<?xml version="1.0" encoding="utf-8"?>
<!--RSS generated by RSSviaXmlTextWriter at Thu, 02 Mar 2017 16:35:42 GMT-->
<rss version="2.0">
  <channel>
    <title>The Tribune</title>
    <link>http://www.tribuneindia.com/</link>
    <description>Tribune News Service</description>
    <item>
      <category>Jammu &amp; Kashmir</category>
      <link>http://www.tribuneindia.com/news/jammu-kashmir/valley-schools-reopen-after-8-months/371227.html</link>
      <title>Valley schools reopen after 8 months</title>
      <image>http://images.tribuneindia.com/cms/gall_content/2017/3/2017_3$largeimg01_Wednesday_2017_232657586.jpg</image>
      <description>SRINAGAR: The schools in the Valley reopened fully today after remaining closed for eight months bringing back the liveliness that had been missing in the winter months. The schools were shut after the eruption of unrest following the killing of Hizbul commander Burhan Wani on July 8 last year.</description>
      <pubDate>Thu, 02 Mar 2017 00:57:23 GMT</pubDate>
      <guid>http://www.tribuneindia.com/news/jammu-kashmir/valley-schools-reopen-after-8-months/371227.html</guid>
    </item>
    <item>
      <category>Jammu &amp; Kashmir</category>
      <link>http://www.tribuneindia.com/news/jammu-kashmir/registration-of-pilgrims-for-amarnath-yatra-begins/371228.html</link>
      <title>Registration of pilgrims for Amarnath yatra begins</title>
      <image>http://images.tribuneindia.com/cms/gall_content/2017/3/2017_3$largeimg01_Wednesday_2017_232805408.jpg</image>
      <description>JAMMU: Amid the chants of “Bham Bham Bhole”, the registration of pilgrims for this year’s pilgrimage to the Amarnath cave shrine commenced for both Baltal and Chandanwari routes here today.</description>
      <pubDate>Thu, 02 Mar 2017 00:57:23 GMT</pubDate>
      <guid>http://www.tribuneindia.com/news/jammu-kashmir/registration-of-pilgrims-for-amarnath-yatra-begins/371228.html</guid>
    </item>
    <item>
      <category>Jammu &amp; Kashmir</category>
      <link>http://www.tribuneindia.com/news/jammu-kashmir/ladakh-worried-over-costly-air-travel-ahead-of-tourist-season/371235.html</link>
      <title>Ladakh worried over costly air travel ahead of tourist season</title>
      <image>http://images.tribuneindia.com/cms/gall_content/2017/3/2017_3$largeimg01_Wednesday_2017_233238395.jpg</image>
      <description>JAMMU: With Ladakh bracing up to host domestic and foreign tourists, the exorbitant air travel to the arid region continues to be a cause for worry for all stakeholders as the Civil Aviation Ministry is yet to make a formal commitment on “rationalisation of airfares” for Ladakh during peak tourist season from May to September.</description>
      <pubDate>Thu, 02 Mar 2017 00:57:23 GMT</pubDate>
      <guid>http://www.tribuneindia.com/news/jammu-kashmir/ladakh-worried-over-costly-air-travel-ahead-of-tourist-season/371235.html</guid>
    </item>
    <item>
      <category>Jammu &amp; Kashmir</category>
      <link>http://www.tribuneindia.com/news/jammu-kashmir/take-steps-for-benefits-of-pilgrims-governor-tells-shrine-board-ceo/371220.html</link>
      <title>Take steps for benefits of pilgrims, Governor tells Shrine Board CEO</title>
      <image>http://images.tribuneindia.com/cms/gall_content/2017/3/2017_3$largeimg01_Wednesday_2017_232356489.jpg</image>
      <description>JAMMU: Governor NN Vohra today said several important issues related to Katra and its surrounding areas were conclusively addressed in the meeting held at Raj Bhawan on February 17 in which Chief Minister Mehbooba Mufti was also present.</description>
      <pubDate>Thu, 02 Mar 2017 00:57:23 GMT</pubDate>
      <guid>http://www.tribuneindia.com/news/jammu-kashmir/take-steps-for-benefits-of-pilgrims-governor-tells-shrine-board-ceo/371220.html</guid>
    </item>
    <item>
      <category>Jammu &amp; Kashmir</category>
      <link>http://www.tribuneindia.com/news/jammu-kashmir/police-find-weapons-in-cross-loc-truck/371225.html</link>
      <title>Police find weapons in cross-LoC truck</title>
      <image>http://images.tribuneindia.com/cms/gall_content/archive/</image>
      <description>SRINAGAR: A cache of weapons, which was being smuggled for militants in the Kashmir valley, was recovered from a truck engaged in cross-LoC trade in north Kashmir’s Baramulla district, the police said today.</description>
      <pubDate>Thu, 02 Mar 2017 00:57:23 GMT</pubDate>
      <guid>http://www.tribuneindia.com/news/jammu-kashmir/police-find-weapons-in-cross-loc-truck/371225.html</guid>
    </item>
    <item>
      <category>Jammu &amp; Kashmir</category>
      <link>http://www.tribuneindia.com/news/jammu-kashmir/shutdown-in-bannihal-town-over-twin-deaths/371579.html</link>
      <title>Shutdown in Bannihal town over twin deaths</title>
      <image>http://images.tribuneindia.com/cms/gall_content/archive/</image>
      <description>JAMMU: A shutdown marred life in a Jammu and Kashmir town on Thursday amid allegations that the driver and a cleaner in a truck found dead in an accident had actually been murdered.</description>
      <pubDate>Thu, 02 Mar 2017 13:28:31 GMT</pubDate>
      <guid>http://www.tribuneindia.com/news/jammu-kashmir/shutdown-in-bannihal-town-over-twin-deaths/371579.html</guid>
    </item>
  </channel>
</rss>

我认为第 2 行造成的问题是

<!--RSS generated by RSSviaXmlTextWriter at Thu, 02 Mar 2017 16:35:42 GMT-->

显然第一行被 xmlpullparser 视为注释，但问题在于第二行。我猜解析器无法解析第二行，因为它正在搜索起始标记并且确实遇到了字符。

这是我的解析器

public class SitesXmlPullParserTribuneLocal{

    static final String KEY_SITE = "item";
    static final String KEY_NAME = "title";
    static final String KEY_LINK = "link";
    static final String KEY_ABOUT = "description";
    static final String KEY_IMAGE_URL = "image";
    static final String KEY_DATE = "pubDate";
    private static boolean firstCheck = true;


    public static List<NewsItems> getStackSitesFromFile(Context ctx) {

        // List of StackSites that we will return
        List<NewsItems> newsItems;
        newsItems = new ArrayList<NewsItems>();

        // temp holder for current StackSite while parsing
        NewsItems curNewsItems = null;


        // Temporary Holder for current text value while parsing
        String curText = "";

        try {
            // Get our factory and PullParser
            XmlPullParserFactory factory = XmlPullParserFactory.newInstance();
            XmlPullParser xpp = factory.newPullParser();

            // Open up InputStream and Reader of our file.
            FileInputStream fis = ctx.openFileInput("TribuneLocal.xml");
//            Log.e("ERROR at TribuneLocal", String.valueOf(ctx.openFileInput("TribuneLocal.xml")));
            BufferedReader reader = new BufferedReader(new InputStreamReader(fis));

            // point the parser to our file.
            xpp.setInput(reader);

            // get initial eventType
            int eventType = xpp.getEventType();
            Log.e("TagName Local", String.valueOf(eventType));


            //To get the actual location to start parsing from
            boolean actual_work = false;

            // Loop through pull events until we reach END_DOCUMENT
            Log.e("You Reached To", "Mark1");

            while (eventType != XmlPullParser.END_DOCUMENT) {
                // Get the current tag
                Log.e("You Reached To", "Mark2");
                String tagName = xpp.getName();
                Log.e("You Reached To", "Mark3");
//                Log.e("TagName is", tagName);
                // React to different event types appropriately
                if (eventType != XmlPullParser.START_TAG ) {
                    Log.e("EventType Inside", String.valueOf(eventType));
//                    firstCheck = false;
                    eventType = xpp.next();
                    continue;
                }
                xpp.setFeature("http://xmlpull.org/v1/doc/features.html#relaxed", true);

                Log.e("EventType Outside", String.valueOf(eventType));
                switch (eventType) {
                    case XmlPullParser.START_TAG:
                        Log.e("You Reached To", "Mark4");
                        if (tagName.equalsIgnoreCase(KEY_SITE)) {
                            // If we are starting a new <news> block we need
                            //a new NewsItems object to represent it
                                actual_work = true;
                                curNewsItems = new NewsItems();
                        }

                        break;

                    case XmlPullParser.TEXT:
                        //grab the current text so we can use it in END_TAG event
                        curText = xpp.getText();
                        break;

                    case XmlPullParser.END_TAG:
                        if (tagName.equalsIgnoreCase(KEY_SITE) && actual_work) {
                            // if </item> then we are done with current Site
                            // add it to the list.
                            newsItems.add(curNewsItems);

                        } else if (tagName.equalsIgnoreCase(KEY_NAME) && actual_work) {
                            // if </title> use setTitle() on curSite
                            Log.e("TITLE IS ",curText);

                            curNewsItems.setTitle(curText);

                        } else if (tagName.equalsIgnoreCase(KEY_LINK) && actual_work) {
                            // if </link> use setLink() on curSite
                            Log.e("LINK IS ",curText);

                            curNewsItems.setLink(curText);
                        } else if (tagName.equalsIgnoreCase(KEY_ABOUT) && actual_work) {
                            // if </description> use setDescription() on curSite
                            Log.e("DESCRIPTION IS ",curText);

                            curNewsItems.setDescription(curText);
                        } else if (tagName.equalsIgnoreCase(KEY_DATE) && actual_work) {
                            // if </image> use setImgUrl() on curSite
                            Log.e("DATE IS  : ",curText);

                            curNewsItems.setDate(curText);
                        }else if (tagName.equalsIgnoreCase(KEY_IMAGE_URL) && actual_work) {
                            // if </image> use setImgUrl() on curSite
                            Log.e("IMAGE IS  : ",curText);

                            curNewsItems.setImgUrl(curText);
                        }
                        break;

                    default:
                        break;
                }
                //move on to next iteration
                eventType = xpp.next();
            }
        } catch (Exception e) {
            Log.e("Tribune Local File","There is an ERROR Parsing It");
        }

        // return the populated list.
        return newsItems;
    }
}

score 0 · Accepted Answer

xml文件没问题

这会解析文件，然后您可以使用 DOM 并提取元素，...

        DocumentBuilderFactory builderFactory =DocumentBuilderFactory.newInstance();
        builderFactory.setNamespaceAware(true);
        DocumentBuilder builder = builderFactory.newDocumentBuilder();
        // PARSE
        Document document = builder.parse(new InputSource(new InputStreamReader(fis)));

java - 解析开始时包含一些未知字符的 XML 文件

1 回答 1

Related

Reference