I'm using BaseX XML database and have a lot of XML data, approximately 50 000 files of various size. However, one of my local functions I have implemented are to computational heavy. Unfortunately it is very crucial in my work.
Let us assume I have 50 000 files for every Student, and every Student has an attribute called friend
. I want to find out for each Student, how many friends the Student has.
Here are some example code:
declare variable $context := /Students
declare function local:CalculateFriends($student)
{
let $studentName := $student/@Name
return fn:count($context[@friend = $studentName])
}
for $s in $context
let $numberOfFriends := local:CalculateFriends($s)
return <Student Name = '{$s/@Name}' NumberOfFriends = '{$numberOfFriends}' />
This code works fine for one single student. For 1000 students, it takes approximately 5 minutes. Imagine for 50 000 students. It either crashes or gets timeout, I cannot debug it. Left it to calculate overnight and came back, nothing happened.
Is there a way to optimize this? Since using @friend = $studentName
it makes use of attribute index (it is enabled). Having taken a parallel course in university, my first thought was to parallelize the count and flwor statement into chunks, similar to OpenMP. But after some research it does not seem to support parallelized queries.
Anyone have any idea on how to approach this problem?
Thanks!
EDIT: Example of XML structure
<Student Name="Kevin" friend="Alvin" BirthDate="1985-06-29" etc..>
<More meta data> ....... />
</Student>