java - Personal Project "RSS FEED" XML Parser

Question

I am relatively new to Java and I have been trying to figure out how to reach the following tags for output for a couple of long, LONG days now. I would really appreciate some insight into the problem. It seems like everything I could find and or try just does not pan out right. (Excuse the cheesy news articles)

<item>
<pubDate>Sat, 21 Sep 2013 02:30:23 EDT</pubDate>
<title>
<![CDATA[
Carmen Bryan Lashes Out at Beyonce Fans for Throwing Shade (@carmenbryan)
]]>
</title>
<link>
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/
</link>
<guid>
http://www.vladtv.com/blog/174937/carmen-bryan-lashes-out-at-beyonce-fans-for-throwing-shade/
</guid>
<description>
<![CDATA[
<img ... /><br />.
 <p>In response to someone who reminded Bryan that Jay Z has Beyonce now, she tweeted.</p>
 <p>Check out what else Bryan had to say above.</p>
 <p>Source: </p>
]]>
</description>
</item>

I have managed to parse the XML and print out the content in both the title and description element tags, however the output for the description element tag also includes all its child element tags. I would like to use this project in future to build on my Java portfolio, please help!

My code so far:

public class NewXmlReader
    {

        /**
         * @param args the command line arguments
         */
        public static void main(String[] args) {
                try {

                        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
                        DocumentBuilder builder = factory.newDocumentBuilder();
                        Document docXml = builder.parse(NewXMLReaderHandlers.inputHandler());
                        docXml.getDocumentElement().normalize();

                        NewXMLReaderHandlers.handleItemTags(docXml, "item");

                } catch (ParserConfigurationException | SAXException parserConfigurationException) {
                        System.out.println("You Are Not XML formated !!");
                        parserConfigurationException.printStackTrace();
                } catch (IOException iOException) {
                        System.out.println("URL NOT FOUND");
                        iOException.getCause();
                }
        }

    }

public class NewXMLReaderHandlers {

    private static int ARTICLELENGTH;

    public static String inputHandler() throws IOException {
        InputStreamReader inputStream = new InputStreamReader(System.in);
        BufferedReader bufferRead = new BufferedReader(inputStream);
        System.out.println("Please Enter A Proper URL: ");
        String urlPageString = bufferRead.readLine();
        return urlPageString;
    }

    public static void handleItemTags( Document document, String rssFeedParentTopicTag){
        NodeList listOfArticles = document.getElementsByTagName(rssFeedParentTopicTag);
        NewXMLReaderHandlers.ARTICLELENGTH = listOfArticles.getLength();
        String rootElement = document.getDocumentElement().getNodeName();
        if (rootElement == "rss"){
            System.out.println("We Have An RSS Feed To Parse");

            for (int i = 0; i < NewXMLReaderHandlers.ARTICLELENGTH; i++) {
                Node itemNode = (Node) listOfArticles.item(i);
                if (itemNode.getNodeType() == Node.ELEMENT_NODE) {
                    Element itemElement= (Element) itemNode;
                    tagContent (itemElement, "title");
                    tagContent (itemElement, "description");
                }
            }
        }

    }

    public static void tagContent (Element item, String tagName) {
            NodeList tagNodeList = item.getElementsByTagName(tagName);
            Element tagElement = (Element)tagNodeList.item(0);
            NodeList tagTElist = tagElement.getChildNodes();
            Node tagNode = tagTElist.item(0);

//          System.out.println( " - " + tagName + " : " + tagNode.getNodeValue() + "\n");
            if(tagName == "description"){
                System.out.println( " - " + tagName + " : " + tagNode.getNodeValue() + "\n\n");
                System.out.println(" Do We Have Any Siblings? " + tagNode.getNextSibling().getNodeValue() + "\n");
            }
        }
    }

score 2 · Accepted Answer

For my money, the easiest solution would be to use the XPath API.

Essentially, it's a query language for XML. See XPath Tutorial for a primer.

This example uses the RSS feed from SO, which uses <entry...> instead of <item>, but I've used the same technique for other RSS (and XML) files and even very complex HTML documents...

import java.io.IOException;
import java.util.logging.Level;
import java.util.logging.Logger;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

public class TestRSSFeed {

    public static void main(String[] args) {
        try {
            // Read the feed...
            DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
            Document doc = factory.newDocumentBuilder().parse("http://stackoverflow.com/feeds/tag?tagnames=java&sort=newest");
            Element root = doc.getDocumentElement();

            // Create a xPath instance
            XPath xPath = XPathFactory.newInstance().newXPath();
            // Find all the nodes that are named <entry...> any where in
            // the document that live under the parent node...
            XPathExpression expression = xPath.compile("//entry");
            NodeList nl = (NodeList) expression.evaluate(root, XPathConstants.NODESET);

            System.out.println("Found " + nl.getLength() + " items...");
            for (int index = 0; index < nl.getLength(); index++) {
                Node node = nl.item(index);
                // This is a sub node search.
                // The search is based on the parent node and looks for a single
                // node titled "title" that belongs to the parent node...
                // I did this because I'm only expecting a single node...
                expression = xPath.compile("title");
                Node child = (Node) expression.evaluate(node, XPathConstants.NODE);
                System.out.println(child.getTextContent());
            }

        } catch (IOException | ParserConfigurationException | SAXException exp) {
            exp.printStackTrace();
        } catch (XPathExpressionException ex) {
            ex.printStackTrace();
        }
    }

}

Now, you can do some pretty complex queries, but I thought I'd start with a basic example ;)

score 0 · Accepted Answer

Just in case anyone is still left wondering about how i managed to solve the CDATA puzzle:

The logic is as follows:

Once you get the program to extract all the xml to display the correct node tree as the rss feed displays, if any xml data is wrapped in CDATA tags, the only way to access that information is by creating new xml based on the text content in the CDATA tag. Once you parse the new document, you should be able to access all the data you need.

java - Personal Project "RSS FEED" XML Parser

2 回答 2

Related

Reference