java - 为什么选择器使用 text() 节点测试时 Javax 的 XPath evaluate() 方法不返回带有不间断空格的元素

Question

我有以下Java代码

    @Test
    public void notGettingNonBreakingSpace() throws ParserConfigurationException, IOException, SAXException, XPathExpressionException {
        DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
        documentBuilderFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

        DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();

        String html = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
            "<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Strict//EN\" \n" +
            "\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\">\n" +
            "<html xmlns=\"http://www.w3.org/1999/xhtml\">\n" +
            "<body><table><tr><td>&nbsp;</td></tr></table></body>\n" +
            "</html>";

        Document document = documentBuilder.parse(new ByteArrayInputStream(html.getBytes()));
        XPath xpath = XPathFactory.newInstance().newXPath();
        int result = ((NodeList) xpath.evaluate("//tr/td/text()", document, XPathConstants.NODESET)).getLength();
        assertEquals(1, result);
    }

断言失败，result原样0。但是，如果我将 HTML 保存为.htm文件，并在 Chrome 中打开它，则$x("//tr/td/text()")在开发人员工具控制台中将按预期返回：

[text]
> 0: text
  length: 1
> __proto__: Array(0)

我需要做什么才能在 Java 中获得相同的结果，即包含一项的节点列表？

DocumentBuilder 或 XPath 对象上是否有“忽略空格”设置，或者是 Java 和 Chrome 的 JS 引擎不同意如何处理该特殊空格字符的根本原因？

注意：删除text()（即文本节点选择）作品；然后它返回正确的结果。 用实际文本（例如）替换不间断空格（foo）也可以...

score 1 · Accepted Answer

看起来 Java 无法识别 何时禁用 dtd 加载。

您的问题可以通过 在 html 中编写一个实体来解决，例如：

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" [ <!ENTITY nbsp " "> ]>

评估现在给出一个文本节点。

java - 为什么选择器使用 text() 节点测试时 Javax 的 XPath evaluate() 方法不返回带有不间断空格的元素

1 回答 1

Related

Reference