我第一次发帖!
我遇到的问题是我正在使用 XPath 和 Tag-Soup 来解析网页并读取数据。由于这些是新闻文章,有时它们在内容中嵌入了链接,这些是我的程序搞砸的地方。
我正在使用的 XPath 是storyPath = "//html:article//html:p//text()";
页面具有以下结构的地方:
<article ...>
<p>Some text from the story.</p>
<p>More of the story, which proves <a href="">what a great story this is</a>!</p>
<p>More of the story without links!</p>
</article>
我与 xpath 评估相关的代码是这样的:
NodeList nL = XPathAPI.selectNodeList(doc,storyPath);
LinkedList<String> story = new LinkedList<String>();
for (int i=0; i<nL.getLength(); i++) {
Node n = nL.item(i);
String tmp = n.toString();
tmp = tmp.replace("[#text:", "");
tmp = tmp.replace("]", "");
tmp = tmp.replaceAll("’", "'");
tmp = tmp.replaceAll("‘", "'");
tmp = tmp.replaceAll("–", "-");
tmp = tmp.replaceAll("¬", "");
tmp = tmp.trim();
story.add(tmp);
}
this.setStory(story);
...
private void setStory(LinkedList<String> story) {
String tmp = "";
for (String p : story) {
tmp = tmp + p + "\n\n";
}
this.story = tmp.trim();
}
这给我的输出是
Some text from the story.
More of the story, which proves
what a great story this is
!
More of the story without links!
有没有人有办法让我消除这个错误?我在某处采取了错误的方法吗?(我知道我很可能使用 setStory 代码,但看不到其他方式。
如果没有 tmp.replace() 代码,所有结果都会显示为 [#text: what a great story this is] 等
编辑:
我仍然遇到麻烦,尽管可能是另一种问题。。在这里杀死我的又是一个链接,但是 BBC 拥有他们的网站的方式,该链接位于单独的行上,因此它仍然会出现同样的问题如前所述(请注意,给出的示例已解决问题)。BBC页面上的代码部分是:
<p> Former Queens Park Rangers trainee Sterling, who
<a href="http://news.bbc.co.uk/sport1/hi/football/teams/l/liverpool/8541174.stm" >moved to the Merseyside club in February 2010 aged 15,</a>
had not started a senior match for the Reds before this season.
</p>
在我的输出中显示为:
Former Queens Park Rangers trainee Sterling, who
moved to the Merseyside club in February 2010 aged 15,
had not started a senior match for the Reds before this season.