xml - Approaches for performing a "Grep -f" style query on XML data?

Question

I saw several great command-line XML manipulation tools in this discussion, and I'm exploring new ways to extract data from XML files through scripting instead of compiled programs. I'm currently trying out xmlstarlet, but I'm not restricted to using this tool.

I have an XML data file that has tens of thousands of elements. I'd like to extract a subset of those elements based on a list of search terms, and then pipe or otherwise route those elements into some downstream scripts and transforms. The search terms are simple strings--there's no need for regular expressions. If I was doing this with grep on a regular text file, I would probably do something simple like:

grep -Ff StringsToSearchFor.txt MassiveFile.txt | [chain of additional commands]

I've been looking through the documentation for tools like xmlstarlet on ways that I could achieve this, and the closest thing I can come up with is this ugly attempt that uses a temporary file. (Note, I am using Windows):

REM Create tempOutput.xml, with an open root node 

REM %1 is the file containing the list of strings
REM %2 is the target XML file
for /F %%A in (%1) do (
   REM Search for a single matching node, and append the output to tempOutput.xml
   xml sel -I -t -c "path/to/search[targetElement='%%A']" %2 >> tempOutput.xml
)

REM Close root node to tempOutput.xml

REM After this stage, pass tempOutput.xml as the input to downstream XML transforms and tools

Needless to say, this is really ugly.

I suppose that one possibility is to modify the for loop to pass a giant list of -c XPath queries to xmlstarlet all in one shot, but that also seems unnecessarily messy, and I think that I would still be stuck with using the tempOutput.xml file.

Is there a more elegant way to do this? Or is a temporary file really my best approach?

score 1 · Accepted Answer

You could write an XSLT stylesheet that takes the target XML as its source document, and reads the file containing the list of strings using document(). (If you're using XSLT 2.0, this document doesn't have to be in XML.) It could then parse the list of strings, and look for XPath matches in the target XML document for any of the strings:

<xsl:for-each select="$strings-to-match">
  <xsl:for-each select="/path/to/search[targetElement = current()]">
    <!-- whatever format you need to output these in... -->
    <xsl:value-of select="." />
  </xsl:for-each>
</xsl:for-each>

This would output the string value (concatenated descendant text nodes) of the elements that match. You could output whatever you want at that point, depending on the needs of downstream programs.

score 0 · Accepted Answer

With my Xidel you can write it like this:

xidel --extract-exclude=search-terms  StringsToSearchFor.txt -e '$search-terms := tokenize($raw, $line-ending)[. != ""]' MassiveFile.txt -e 'path/to/search[targetElement = $search-terms]'

But it might be a little slow for a big file (it used to be fast, even with streaming xml, but I threw all optimizations out when implementing full XQuery; that was complicated enough already).

score 0 · Accepted Answer

Not only, but especially if you're analyzing that file repeatedly, think of trying some XML database. Most of them are supporting indices for string search, which will heavily speed up search. You might even be very happy with performing further analysis within XQuery.

An XPath (subset of XQuery) expression to perform your search would be

/path/to/search[targetElement = ('list', 'of', 'strings', 'to', 'search', 'for')]

Some implementations support XQuery Full Text, which even enhances text search (especially with efficient indices):

/path/to/search[targetElement contains text { 'list', 'of', 'strings' }]

Reading this list of word is easy, but depends how it is stored and which implementation you're using.

BaseX is one of those databases (and open source software, disclaimer: I'm somewhat affiliatet with them). galax also has XQuery Full Text support, other famous XML databases and XQuery processors are eXist DB, Saxon, Sedna and Marklogic. All of them have some command line tool which prints results to STDOUT, so you can pipe it into your remaining processing chain.

All of that queries (including your's) will return all parent elements if any child contains that string. You might want to use targetElement/text() to limit to those elements containing the needle you're looking for instead.

xml - Approaches for performing a "Grep -f" style query on XML data?

3 回答 3

Related

Reference