everyone.
I'm trying to filter a big xml file (from a BLAST) to keep only some <Interaction>
nodes defined by a list of <Iteration_iter-num>
values that I define from a file. Here is a simplified example (the real Blast.xml have more than 80000 Iterations):
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastx</BlastOutput_program>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>3037</Iteration_iter-num>
<Iteration_query-ID>Query_3037</Iteration_query-ID>
</Iteration>
<Iteration>
<Iteration_iter-num>5673</Iteration_iter-num>
<Iteration_query-ID>Query_5673</Iteration_query-ID>
</Iteration>
<Iteration>
<Iteration_iter-num>11397</Iteration_iter-num>
<Iteration_query-ID>Query_11397</Iteration_query-ID>
</Iteration>
<Iteration>
<Iteration_iter-num>15739</Iteration_iter-num>
<Iteration_query-ID>Query_15739</Iteration_query-ID>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
and I have a file with the iterations to keep (saved as keep_iter):
5673
11397
For this kind of low scale problem I managed to do the filtering with xmlstarlet, creating first a version of the file to store the string for the comparison (saved as filter):
Iteration_iter-num!=5673 and Iteration_iter-num!=11397
This works as a charm with:
cat Blast.xml | xmlstarlet ed -d "/BlastOutput/BlastOutput_iterations/Iteration[`cat filter`]" > finalBlast.xml
Basically, I removed all the Iteration nodes that were not in the filter file Obtaining:
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
<BlastOutput>
<BlastOutput_program>blastx</BlastOutput_program>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>5673</Iteration_iter-num>
<Iteration_query-ID>Query_5673</Iteration_query-ID>
</Iteration>
<Iteration>
<Iteration_iter-num>11397</Iteration_iter-num>
<Iteration_query-ID>Query_11397</Iteration_query-ID>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
The problem is that I really have a keep_iter file with 20000 values to filter. When I create the filter file and run the xmlstarlet command above, the argument is obviously too long.
Any suggestion to filter such a Blast.xml file to keep only those Iteration nodes which iteration number is listed in the keep_iter file (with 20k values)? I want to keep the original xml structure.