xml - 大文档 XPath 查询性能

Question

对于 5 MB 的文档，以下查询需要libxml23 秒来评估。我能做些什么来加快速度吗？我需要生成的节点集进行进一步处理，所以 nocount等。

谢谢！

descendant::text() | descendant::*
[
self::p or
self::h1 or
self::h2 or
self::h3 or
self::h4 or
self::h5 or
self::h6 or
self::dl or
self::dt or
self::dd or
self::ol or
self::ul or
self::li or
self::dir or
self::address or
self::blockquote or
self::center or
self::del or
self::div or
self::hr or
self::ins or
self::pre
]

编辑：

按照Jens Eratdescendant::node()[self::text() or self::p or ...的建议使用（参见接受的答案）显着提高了速度；从最初的 2.865330s 到刚刚完美的 0.164336s。

score 3 · Accepted Answer

在没有任何文件可作为基准的情况下进行基准测试是非常困难的。

优化的两个思路：

使用尽可能少的descendant::轴步长。它们很贵，可能你可以加快一点。您可以text()像这样组合和元素测试：
```
descendant::node()[self::text() or self::h1 or self::h2]
```
并扩展所有元素（我将查询保持简短以提高可读性）。
使用字符串测试而不是节点测试。他们可能会更快（可能不是，请参阅答案的评论）。当然，您需要保留text()测试。
```
descendant::node()[self::text() or local-name(.) = 'h1' or local-name(.) = 'h2']
```

如果您经常查询同一个文档，请考虑使用原生 XML 数据库，如 BaseX、eXist DB、Zorba、Marklogic ......（前三个是免费的）。他们将索引放在您的数据上，并且应该能够更快地提供结果（并支持 XPath 2.0/XQuery，这使得开发更加容易）。它们都具有适用于大量编程语言的 API。

score 0 · Accepted Answer

您是否在启用 --with-threads 选项的情况下编译了 libxml2？如果是这样，最直接的做法是使用更快的处理器来解决问题

score 0 · Accepted Answer

您的查询相当于

(descendant::text() | descendant::p
    | descendant::h1  | descendant::h2  | descendant::h3 | descendant::h4  | descendant::h5 | descendant::h6
    | descendant::dl  | descendant::dt  | descendant::dd | descendant::ol  | descendant::ul | descendant::li
    | descendant::dir | descendant::address | descendant::blockquote | descendant::center
    | descendant::del | descendant::div | descendant::hr | descendant::ins | descendant::pre
)

但我无法测量其速度的任何差异。

xml - 大文档 XPath 查询性能

3 回答 3

Related

Reference