xml - In Marklogic, how can I efficiently deep-compare two xml documents?

Question

I have a logging requirement to store the differences between old and new values when a (moderately complex) section of a document changes in our database. Only the changed data should be reported on. My current solution works reasonably well, but I have concerns that it's not optimal and may cause performance problems when updates start occurring in volume.

My current solution looks mostly like this:

for $element in $data/section//element()[text()]
return
  if (not($old-data//*[fn:name() = fn:name($element) and text() = $element/text()])) then
    element log:difference {
       ...
    }
  else ()

My problem is that the profiler shows this taking a (relatively) long time doing the thousands of comparisons that //*[fn:name() = fn:name($element)] construct leads to. It's only a couple of tens of milliseconds but with a lot of updates that's going to add up, and it feels like there should be a way to avoid it.

The structure of the xml is sufficiently well defined that I can be sure that a field in one document will have the same relative xpath as the other one, so technically my use of // could be removed, at the expense of manually walking the xml tree, but that's a reasonable amount of complexity and the structure is fairly flat so I'm not sure it would be very much more efficient.

Also, there are a finite set of fields that can be in this section of the document, so manually comparing each of them in turn (with fully qualified xpaths) would be an option, but I'd rather avoid it, since it would be best not to need to revisit this code in the future, should that list of fields change.

Are the solutions going to be along those lines, or is there something more obvious that I've missed?

Is there any way to construct the xpath using the string value of the element name directly without using a predicate? I'm assuming that would be more efficient, since xpath evaluation doesn't normally take as long as this.

Can I, perhaps, extract the relative xpath of an element then look at that precise place in the other document?

Am I missing a built-in xml comparison tool in marklogic itself?

score 3 · Accepted Answer

Using fn:name is a bad idea because it can be fooled by differences in namespace prefixes. It would be better to use fn:node-name. I would also avoid '//' wherever possible.

Getting back to the deep compare, this sounds like an XML diff. There is no XML diff tool built into MarkLogic, so it might be best to set one up as a REST-ish web service and use MarkLogic http://docs.marklogic.com/xdmp:http-post to call it. There are quite a few XML diff tools out there.

If you want to stay in XQuery, the solution will probably be slower. I would start with a recursive tree-walk and fn:deep-equal. Whenever you find a diff for a simple element you can stop descending, which prunes the tree and limits the work to be done. Here's a very rough sketch of how that might work. It's a long way from a proper LCS http://en.wikipedia.org/wiki/Diff but it might be useful. On my laptop this runs in less than 10-ms.

declare function local:diff(
  $a as node(), $b as node())
as element(diff)*
{
  if (deep-equal($a, $b)) then ()
  else if (empty($a/*) or empty($b/*)) then element diff {
    element a { $a }, element b { $b } }
  else
    let $seq-a := $a/*
    let $seq-b := $b/*
    let $count := max((count($seq-a), count($seq-b)))
    return
      for $x in 1 to $count
      return local:diff($seq-a[$x], $seq-b[$x])
};

let $a := xdmp:query-meters()
let $_ := xdmp:sleep(1)
let $b := xdmp:query-meters()
return local:diff($a, $b)

score 1 · Accepted Answer

I would think it's worthwhile to try building an index, and benchmarking that approach.

I'm not well versed in marklogic, but they have what I recognize as an XSL key function in their API docs

(Update: this seems to only fetch keys. To create them, I'd guess you'd need to use XSLT directly. This is a good how-to. A small stylesheet generating keys on element/@id would be feasible.)

You could even add the stylesheet as a string, and save a little I/O time:

xdmp:xslt-eval(
  <xsl:stylesheet version="2.0"><xsl:key name="element_ids" match="element" use="@id"></xsl:stylesheet>,
  doc("input.xml")
)

If every element has an identifier you can use as a key, you can build an index when you parse the file, then compare that list against a stored (earlier) version of keys. From there, you have your list of locations to handle, and thanks to the index, they are found and accessed quite quickly.

If you'd rather stick with XQuery, the 'map' function provides a similar interface.

xml - In Marklogic, how can I efficiently deep-compare two xml documents?

2 回答 2

Related

Reference