java - 如何使用 Java 从 ATOM 提要中提取 XHTML？

Question

我正在尝试从 RSS 提要中提取一些 XHTML，以便将其放置在 WebView 中。有问题的 RSS 提要有一个名为的标签<content>，内容中的字符是 XHTML。（我要配对的网站是博客提要）尝试提取此内容的最佳方法是什么？这些<字符使我的解析器感到困惑。我已经尝试过 DOM 和 SAX，但都不能很好地处理这个问题。

以下是请求的 XML 示例。在这种情况下，我希望内容标记内的 XHTML 基本上是一个字符串。<content> XHTML </content>

编辑：根据 ignyhere 的建议，我尝试过 XPath，但我仍然遇到同样的问题。这是我的测试的 pastebin 样本。

score 3 · Accepted Answer

我会尝试用 XPath 攻击它。像这样的东西会起作用吗？

public static String parseAtom (InputStream atomIS) 
   throws Exception { 

   // Below should yield the second content block
   String xpathString = "(//*[starts-with(name(),"content")])[2]";
   // or, String xpathString = "//*[name() = 'content'][2]";
   // remove the '[2]' to get all content tags or get the count, 
   // if needed, and then target specific blocks 
   //String xpathString = "count(//*[starts-with(name(),"content")])"; 
   // note the evaluate expression below returns a glob and not a node set

   XPathFactory xpf              = XPathFactory.newInstance (); 
   XPath xpath                   = xpf.newXPath (); 
   XPathExpression xpathCompiled = xpath.compile (xpathString); 

   // use the first to recast and evaluate as NodeList 
   //Object atomOut = xpathCompiled.evaluate ( 
   //   new InputSource (atomIS), XPathConstants.NODESET); 
   String atomOut = xpathCompiled.evaluate ( 
      new InputSource (atomIS), XPathConstants.STRING); 

   System.out.println (atomOut); 

   return atomOut; 

}

score 3 · Accepted Answer

它并不漂亮，但这是我用来解析来自 Blogger 的 ATOM 提要的（本质），它使用XmlPullParser。代码很恶心，但它来自一个真实的应用程序。无论如何，你可能会得到它的一般味道。

    final String TAG_FEED = "feed";

public int parseXml(Reader reader) {
    XmlPullParserFactory factory = null;
    StringBuilder out = new StringBuilder();
    int entries = 0;

    try {
        factory = XmlPullParserFactory.newInstance();
        factory.setNamespaceAware(true);
        XmlPullParser xpp = factory.newPullParser();
        xpp.setInput(reader);

        while (true) {
            int eventType = xpp.next();
            if (eventType == XmlPullParser.END_DOCUMENT) {
                break;
            } else if (eventType == XmlPullParser.START_DOCUMENT) {
                out.append("Start document\n");
            } else if (eventType == XmlPullParser.START_TAG) {
                String tag = xpp.getName();
                // out.append("Start tag " + tag + "\n");
                if (TAG_FEED.equalsIgnoreCase(tag)) {
                    entries = parseFeed(xpp);
                }
            } else if (eventType == XmlPullParser.END_TAG) {
                // out.append("End tag " + xpp.getName() + "\n");
            } else if (eventType == XmlPullParser.TEXT) {
                // out.append("Text " + xpp.getText() + "\n");
            }
        }
        out.append("End document\n");

    } catch (XmlPullParserException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
    //        return out.toString();
    return entries;

}

private int parseFeed(XmlPullParser xpp) throws XmlPullParserException, IOException {
    int depth = xpp.getDepth();
    assert (depth == 1);
    int eventType;
    int entries = 0;
    xpp.require(XmlPullParser.START_TAG, null, TAG_FEED);
    while (((eventType = xpp.next()) != XmlPullParser.END_DOCUMENT) && (xpp.getDepth() > depth)) {
        // loop invariant: At this point, the parser is not sitting on
        // end-of-document, and is at a level deeper than where it started.

        if (eventType == XmlPullParser.START_TAG) {
            String tag = xpp.getName();
            // Log.d("parseFeed", "Start tag: " + tag);    // Uncomment to debug
            if (FeedEntry.TAG_ENTRY.equalsIgnoreCase(tag)) {
                FeedEntry feedEntry = new FeedEntry(xpp);
                feedEntry.persist(this);
                entries++;
                // Log.d("FeedEntry", feedEntry.title);    // Uncomment to debug
                // xpp.require(XmlPullParser.END_TAG, null, tag);
            }
        }
    }
    assert (depth == 1);
    return entries;
}

class FeedEntry {
    String id;
    String published;
    String updated;
    // Timestamp lastRead;
    String title;
    String subtitle;
    String authorName;
    int contentType;
    String content;
    String preview;
    String origLink;
    String thumbnailUri;
    // Media media;

    static final String TAG_ENTRY = "entry";
    static final String TAG_ENTRY_ID = "id";
    static final String TAG_TITLE = "title";
    static final String TAG_SUBTITLE = "subtitle";
    static final String TAG_UPDATED = "updated";
    static final String TAG_PUBLISHED = "published";
    static final String TAG_AUTHOR = "author";
    static final String TAG_CONTENT = "content";
    static final String TAG_TYPE = "type";
    static final String TAG_ORIG_LINK = "origLink";
    static final String TAG_THUMBNAIL = "thumbnail";
    static final String ATTRIBUTE_URL = "url";

    /**
    * Create a FeedEntry by pulling its bits out of an XML Pull Parser. Side effect: Advances
    * XmlPullParser.
    * 
    * @param xpp
    */
public FeedEntry(XmlPullParser xpp) {
    int eventType;
    int depth = xpp.getDepth();
    assert (depth == 2);
    try {
        xpp.require(XmlPullParser.START_TAG, null, TAG_ENTRY);
        while (((eventType = xpp.next()) != XmlPullParser.END_DOCUMENT)
        && (xpp.getDepth() > depth)) {

            if (eventType == XmlPullParser.START_TAG) {
                String tag = xpp.getName();
                if (TAG_ENTRY_ID.equalsIgnoreCase(tag)) {
                    id = Util.XmlPullTag(xpp, TAG_ENTRY_ID);
                } else if (TAG_TITLE.equalsIgnoreCase(tag)) {
                    title = Util.XmlPullTag(xpp, TAG_TITLE);
                } else if (TAG_SUBTITLE.equalsIgnoreCase(tag)) {
                    subtitle = Util.XmlPullTag(xpp, TAG_SUBTITLE);
                } else if (TAG_UPDATED.equalsIgnoreCase(tag)) {
                    updated = Util.XmlPullTag(xpp, TAG_UPDATED);
                } else if (TAG_PUBLISHED.equalsIgnoreCase(tag)) {
                    published = Util.XmlPullTag(xpp, TAG_PUBLISHED);
                } else if (TAG_CONTENT.equalsIgnoreCase(tag)) {
                    int attributeCount = xpp.getAttributeCount();
                    for (int i = 0; i < attributeCount; i++) {
                        String attributeName = xpp.getAttributeName(i);
                        if (attributeName.equalsIgnoreCase(TAG_TYPE)) {
                            String attributeValue = xpp.getAttributeValue(i);
                            if (attributeValue
                            .equalsIgnoreCase(FeedReaderContract.FeedEntry.ATTRIBUTE_NAME_HTML)) {
                                contentType = FeedReaderContract.FeedEntry.CONTENT_TYPE_HTML;
                                } else if (attributeValue
                                .equalsIgnoreCase(FeedReaderContract.FeedEntry.ATTRIBUTE_NAME_XHTML)) {
                                    contentType = FeedReaderContract.FeedEntry.CONTENT_TYPE_XHTML;
                                } else {
                                    contentType = FeedReaderContract.FeedEntry.CONTENT_TYPE_TEXT;
                                }
                                break;
                            }
                        }
                        content = Util.XmlPullTag(xpp, TAG_CONTENT);
                        extractPreview();
                    } else if (TAG_AUTHOR.equalsIgnoreCase(tag)) {
                        // Skip author for now -- it is complicated
                        int authorDepth = xpp.getDepth();
                        assert (authorDepth == 3);
                        xpp.require(XmlPullParser.START_TAG, null, TAG_AUTHOR);
                        while (((eventType = xpp.next()) != XmlPullParser.END_DOCUMENT)
                        && (xpp.getDepth() > authorDepth)) {
                        }
                        assert (xpp.getDepth() == 3);
                        xpp.require(XmlPullParser.END_TAG, null, TAG_AUTHOR);

                    } else if (TAG_ORIG_LINK.equalsIgnoreCase(tag)) {
                        origLink = Util.XmlPullTag(xpp, TAG_ORIG_LINK);
                    } else if (TAG_THUMBNAIL.equalsIgnoreCase(tag)) {
                        thumbnailUri = Util.XmlPullAttribute(xpp, tag, null, ATTRIBUTE_URL);
                    } else {
                        @SuppressWarnings("unused")
                            String throwAway = Util.XmlPullTag(xpp, tag);
                    }
                }
            } // while
        } catch (XmlPullParserException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        }
        assert (xpp.getDepth() == 2);
    }
}

public static String XmlPullTag(XmlPullParser xpp, String tag) 
    throws XmlPullParserException, IOException {
    xpp.require(XmlPullParser.START_TAG, null, tag);
    String itemText = xpp.nextText();
    if (xpp.getEventType() != XmlPullParser.END_TAG) {
        xpp.nextTag();
    }
    xpp.require(XmlPullParser.END_TAG, null, tag);
    return itemText;
}

public static String XmlPullAttribute(XmlPullParser xpp, 
    String tag, String namespace, String name)
throws XmlPullParserException, IOException {
    assert (!TextUtils.isEmpty(tag));
    assert (!TextUtils.isEmpty(name));
    xpp.require(XmlPullParser.START_TAG, null, tag);
    String itemText = xpp.getAttributeValue(namespace, name);
    if (xpp.getEventType() != XmlPullParser.END_TAG) {
        xpp.nextTag();
    }
    xpp.require(XmlPullParser.END_TAG, null, tag);
    return itemText;
}

I'll give you a hint: None of the return values matter. The data is saved into a database by a method (not shown) called at this line:

feedEntry.persist(this);

score 1 · Accepted Answer

我可以在这里看到你的问题，这些解析器没有产生正确结果的原因是因为你的<content>标签的内容没有被包装到<![CDATA[ ]]>，我会做什么，直到我找到更合适的解决方案，我会使用快速而肮脏的技巧：

private void parseFile(String fileName) throws IOException {
        String line;
        BufferedReader br = new BufferedReader(new FileReader(new File(fileName)));
        StringBuilder sb = new StringBuilder();
        boolean match = false;

        while ((line = br.readLine()) != null) {
            if(line.contains("<content")){
                sb.append(line);
                sb.append("\n");
                match = true;
                continue;
            }

            if(match){
                sb.append(line);
                sb.append("\n");
                match = false;
            }

            if(line.contains("</content")){
                sb.append(line);
                sb.append("\n");
            }
        }

        System.out.println(sb.toString());
    }

这将为您提供 String 中的所有内容。您可以通过稍微修改此方法来选择性地将它们分开，或者如果您不需要实际<content>，您也可以将其过滤掉。

java - 如何使用 Java 从 ATOM 提要中提取 XHTML？

3 回答 3

Related

Reference