xml - 如何使用 BaseX 命令行从 html 页面中提取 XPATH

Question

我想从这个网页中提取 XPATH //DIV[@id="ps-content"]：http: //www.amazon.com/dp/1449319432（保存为本地文件）

我想用一行命令行和最好的解析器之一来完成，比如 BaseX 或 Saxon-PE。

到目前为止，我（似乎）找到的最短解决方案是这两行：

java -jar tagsoup-1.2.1.jar <page.html >page.xhtml"
basex -ipage.xhtml "//DIV[@id='ps-content']"

但它返回的只是一个空行，而不是我预期的 html 代码块：

我的问题有两个：

我的命令行有什么问题？为什么他们不返回我的 XPATH 定义的预期的 html 代码块？
由于 BaseX 具有嵌入式 TagSoup 功能（请参阅https://www.odesk.com/leaving-odesk?ref=http%253A%252F%252Fdocs.basex.org%252Fwiki%252FParsers%2523HTML_Parser），我如何整合我的两条线成一行？

score 1 · Accepted Answer

您的查询有两个问题：

Tagsoup 添加命名空间

注册命名空间（声明默认命名空间似乎是合理的，因为您可能只处理 XHTML）：
```
basex -ipage.xhtml "declare default element namespace 'http://www.w3.org/1999/xhtml'; //div[@id='ps-content']"
```
或*用作每个元素的命名空间指示符：
```
basex -ipage.xhtml "//*:div[@id='ps-content']"
```
XML/XQuery 区分大小写

我已经在 (1) 中的查询中更正了它：<div/>is not the same as <DIV/>. (1) 中的两个查询都已经产生了预期的结果。

Tagsoup 可以在 BaseX 中使用，您不必为 HTML 输入单独调用它。确保在您的默认 Java 类路径中包含 tagsoup，例如。通过libtagsoup-java在 Debian 中安装。

basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'

如果您愿意，您甚至可以直接从 BaseX 查询 HTML 页面：

basex 'declare option db:parser "html"; doc("http://www.amazon.com/dp/1449319432")//*:div[@id="ps-content"]'

使用-itagoup 对我不起作用，但您可以doc(...)改用。

score 0 · Accepted Answer

我终于找到了正确的命令行：

basex "declare option db:parser 'html'; doc('page.html')//*:div[@id='ps-content']"

注意：像这样反转引号类型在我的 Win7 中不起作用：

basex 'declare option db:parser "html"; doc("page.html")//*:div[@id="ps-content"]'

2 回答 2