所以显然我需要做的就是弄清楚如何使用XPATH从 XML 输出中获取数据。
所以基本上,XPATH 的想法是您可以使用 XML 获取任何节点,在我的例子中,如上图所示,我想要获取非常具体的信息。
这是文章链接的 XPATH:
public static final String XPATH_ARTICLE_LINKS =
"//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='fpss-img_holder_div_landing']/div[@id='fpss-img-div_466']/a/@href";
Where//div[@class='landing-slide']
意味着我正在寻找类名是landing-slide的任何div ,而不管它们在文档中的位置('//' 声明)。从那里开始,我只是进一步进入项目的层次结构,最终获得属性的值(属性通过'@'字符指向)。 href
现在我们有了 XPATH,我们只需要将这个值传递给 HTML 清理器。我正在通过 a 执行此AsyncTask
操作,请记住,这不是最终代码,但它肯定会得到我想要的信息。
首先,使用的 XPATH:
private class News {
static final String XPATH_ARTICLE_LINKS =
"//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='fpss-img_holder_div_landing']/div[@id='fpss-img-div_466']/a/@href";
static final String XPATH_ARTICLE_IMAGES =
"//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='fpss-img_holder_div_landing']/div[@id='fpss-img-div_466']/a/img/@src";
static final String XPATH_ARTICLE_HEADERS =
"//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='landing-fpss-introtext']/div[@class='landing-slidetext']/h1/a";
static final String XPATH_ARTICLE_DESCRIPTIONS =
"//div[@class='landing-slide']/div[@class='landing-slide-inner']/div[@class='landing-fpss-introtext']/div[@class='landing-slidetext']/p";
}
现在对于 AsyncTask:
private class CleanUrlTask extends AsyncTask<Void, Void, Void> {
@Override
protected Void doInBackground(Void... params) {
try {
//try cleaning the nasa page. (Root Node)
mNode = mCleaner.clean(mUrl);
//Get all of the article links
Object[] mArticles = mNode.evaluateXPath(News.XPATH_ARTICLE_LINKS);
//Get all of the image links
Object[] mImages = mNode.evaluateXPath(News.XPATH_ARTICLE_IMAGES);
//Get all of the Article Titles
Object[] mTitles = mNode.evaluateXPath(News.XPATH_ARTICLE_HEADERS);
//Get all of the Article Descriptions
Object[] mDescriptions = mNode.evaluateXPath(News.XPATH_ARTICLE_DESCRIPTIONS);
Constants.logMessage("Found : " + mArticles.length + " articles");
//Value containers
String link, image, title, description;
for (int i = 0; i < mArticles.length; i++) {
//The Nasa Page returns link that are often not fully qualified URL, so I need to append the prefix if needed.
link = mArticles[i].toString().startsWith(FULL_HTML_PREFIX)? mArticles[i].toString() : NASA_PREFIX + mArticles[i].toString();
image = mImages[i].toString().startsWith(FULL_HTML_PREFIX)? mImages[i].toString() : NASA_PREFIX + mImages[i].toString();
//On the previous two items we were getting the attribute value
//Here, we actually need the text inside the actual element, and so we want to cast the object to a TagNode
//The TagNode allows to extract the Text for the supplied element.
title = ((TagNode)mTitles[i]).getText().toString();
description = ((TagNode)mDescriptions[i]).getText().toString();
//Only log the values for now.
Constants.logMessage("Link to article is " + link);
Constants.logMessage("Image from article is " + image);
Constants.logMessage("Title of article is " + title);
Constants.logMessage("Description of article is " + description);
}
} catch (Exception e) {
Constants.logMessage("Error cleaning file" + e.toString());
}
return null;
}
万一有人像我一样迷路了,我希望这可以为您指明道路。