I want to calculate a statistic over all pairwise combinations of the columns of a very large matrix. I have a python script, called jaccard.py
that accepts a pair of columns and computes this statistic over the matrix.
On my work machine, each calculation takes about 10 seconds, and I have about 95000 of these calculations to complete. However, all these calculations are independent from one another and I am looking to use a cluster we have that uses the Torque queueing system and python2.4. What's the best way to parallelize this calculation so it's compatible with Torque?
I have made the calculations themselves compatible with python2.4, but I am at a loss how to parallelize these calculations using subprocess
, or whether I can even do that because of the GIL.
The main idea I have is to keep a constant pool of subprocesses going; when one finishes, read the output and start a new one with the next pair of columns. I only need the output once the calculation is finished, then the process can be restarted on a new calculation.
My idea was to submit the job this way
qsub -l nodes=4:ppn=8 myjob.sh > outfile
myjob.sh
would invoke a main python file that looks like the following:
import os, sys
from subprocess import Popen, PIPE
from select import select
def combinations(iterable, r):
#backport of itertools combinations
pass
col_pairs = combinations(range(598, 2))
processes = [Popen(['./jaccard.py'] + map(str, col_pairs.next()),
stdout=PIPE)
for _ in range(8)]
try:
while 1:
for p in processes:
# If process has completed the calculation, print it out
# **How do I do this part?**
# Delete the process and add a new one
p.stdout.close()
processes.remove(p)
process.append(Popen(['./jaccard.py'] + map(str, col_pairs.next()),
stdout=Pipe))
# When there are no more column pairs, end the job.
except StopIteration:
pass
Any advice on to how to best do this? I have never used Torque and am unfamiliar with subprocessing in this way. I tried using multiprocessing.Pool
on my workstation and it worked flawlessly with Pool.map
, but since the cluster uses python2.4, I'm not sure how to proceed.
EDIT: Actually, on second thought, I could just write multiple qsub scripts, each only working on a single chunk of the 95000 calculations. I could submit something like 16 different jobs, each doing 7125 calculations. It's essentially the same thing.