java - XPath 和链接的问题

Question

我第一次发帖！

我遇到的问题是我正在使用 XPath 和 Tag-Soup 来解析网页并读取数据。由于这些是新闻文章，有时它们在内容中嵌入了链接，这些是我的程序搞砸的地方。

我正在使用的 XPath 是storyPath = "//html:article//html:p//text()";页面具有以下结构的地方：

<article ...>
   <p>Some text from the story.</p>
   <p>More of the story, which proves <a href="">what a great story this is</a>!</p>
   <p>More of the story without links!</p>
</article>

我与 xpath 评估相关的代码是这样的：

NodeList nL = XPathAPI.selectNodeList(doc,storyPath);

LinkedList<String> story = new LinkedList<String>();
    for (int i=0; i<nL.getLength(); i++) {
        Node n = nL.item(i);

        String tmp = n.toString();
        tmp = tmp.replace("[#text:", "");
        tmp = tmp.replace("]", "");
        tmp = tmp.replaceAll("‚Äô", "'");
        tmp = tmp.replaceAll("‚Äò", "'");
        tmp = tmp.replaceAll("‚Äì", "-");
        tmp = tmp.replaceAll("¬", "");
        tmp = tmp.trim();

        story.add(tmp);
    }

this.setStory(story);
...

private void setStory(LinkedList<String> story) {
    String tmp = "";
    for (String p : story) {
        tmp = tmp + p + "\n\n";
    }

    this.story = tmp.trim();
}

这给我的输出是

Some text from the story.

More of the story, which proves 

what a great story this is

!

More of the story without links!

有没有人有办法让我消除这个错误？我在某处采取了错误的方法吗？（我知道我很可能使用 setStory 代码，但看不到其他方式。

如果没有 tmp.replace() 代码，所有结果都会显示为 [#text: what a great story this is] 等

编辑：

我仍然遇到麻烦，尽管可能是另一种问题。。在这里杀死我的又是一个链接，但是 BBC 拥有他们的网站的方式，该链接位于单独的行上，因此它仍然会出现同样的问题如前所述（请注意，给出的示例已解决问题）。BBC页面上的代码部分是：

    <p>    Former Queens Park Rangers trainee Sterling, who 

    <a  href="http://news.bbc.co.uk/sport1/hi/football/teams/l/liverpool/8541174.stm" >moved to the Merseyside club in February 2010 aged 15,</a> 

    had not started a senior match for the Reds before this season.
    </p>

在我的输出中显示为：

    Former Queens Park Rangers trainee Sterling, who 

    moved to the Merseyside club in February 2010 aged 15, 

         had not started a senior match for the Reds before this season.

score 1 · Accepted Answer

首先找到段落，： storyPath = "//html:article//html:p，然后对于每个段落，用另一个 xpath 查询取出所有文本，并将它们连接起来而不用换行，并在段落末尾放置两个新行。

另一方面，您不必这样做replaceAll("‚Äô", "'")。这肯定表明您打开文件不正确。当您打开文件时，您需要将 Reader 传递给标签汤。您应该像这样初始化 Reader：Reader r = new BufferedReader(new InputStreamReader(new FileInputStream("myfilename.html"),"Cp1252"));在其中为文件指定正确的字符集。字符集列表在这里：http ://docs.oracle.com/javase/1.5.0/docs/guide/intl/encoding.doc.html 我猜它是 Windows latin 1。

score 1 · Accepted Answer

[#text:事情只是 DOM Text 节点的表示toString()。toString()当您需要节点的字符串表示以进行调试时，应使用该方法。而不是toString()使用getTextContent()which 返回实际文本。

如果您不希望链接内容出现在单独的行上，那么您可以//text()从 XPath 中删除并直接获取元素节点的 textContent （getTextContent()对于元素返回所有后代文本节点的串联）

String storyPath = "//html:article//html:p";
NodeList nL = XPathAPI.selectNodeList(doc,storyPath);

LinkedList<String> story = new LinkedList<String>();
for (int i=0; i<nL.getLength(); i++) {
    Node n = nL.item(i);
    story.add(n.getTextContent().trim());
}

您必须手动修复诸如此类的事实"‚Äô"表明您的 HTML 实际上是用 UTF-8 编码的，但您正在使用单字节字符集（例如 Windows1252）读取它。与其尝试事后修复它，不如首先弄清楚如何以正确的编码读取数据。

score 1 · Accepted Answer

对于 html 源代码中的新行进入文本文档的编辑问题，您需要在打印它们之前将其删除。而不是System.out.print(text.trim());做System.out.println(text.trim().replaceAll("[ \t\r\n]+", " "));

java - XPath 和链接的问题

3 回答 3

Related

Reference