As Bill says there's two main ways to think about parallelism in Julia: the threaded model, which was introduced in Julia 1.3 and does shared memory parallelism via the Threads.@threads
macro, and distributed processing using the Distributed.@distributed
macro, which parallelizes across different Julia processes.
Threads is definitely closer to an "automagic" parallelism speed-up with minimal or no re-write of code and often a great option, although one has to take care to ensure that whatever operation one is running is threadsafe, so do always check that results come out the same.
As your question was about @distributed
parallelism originally, let me answer that as well. If you do @distributed
parallelism, the simplest mental model (I believe) to think about what's going on is to imagine you're running your code in entirely seperate Julia REPLs.
Here's a version of your code adapted to the @distributed
model:
using Distributed
addprocs(2)
using SharedArrays
using DelimitedFiles
@everywhere begin
using DifferentialEquations
tf(x,w) = x*sin(w*x)
function sys!(dv,v,w,t)
dv[1] = w*v[1]
dv[2] = tf(v[1],w)
end
times = LinRange(0.1,2,25)
params = LinRange(0.1,1.2,100)
end
sols = SharedArray{Float64,2}((length(times),length(params)))
@sync @distributed for i=1:length(params)
println(i)
init_val = [1.0,1.0]
tspan = (0.0,2.0)
prob = ODEProblem(sys!,init_val,tspan,params[i])
sol = solve(prob)
sols[:,i] .= sol(times)[2,:]
end
sols
What has changed?
I added addprocs(2)
at the start of the script. This isn't necessary if you start Julia with p -2
(or whatever number of processes you want) as you do it, but I often find it easier to reason about the code when it explicitly sets up the parallel environment in the code directly. Note that this isn't currently possible for threads, i.e. you need to set your JULIA_NUM_THREADS
environment variable before starting Julia and can't change the number of threads once you're up and running.
I then moved bits of the code into a @everywhere begin ... end
block. This essentially runs the operations enclosed in the block on all processes at the same time. Going back to the mental model of running separate Julia instances, you have to look at what's in your @distributed
loop and make sure that all functions and variables are actually defined on all processes. So e.g. to ensure that every process knows what ODEProblem
is, you need to do using DifferentialEquations
on all of them.
Finally, I added @sync
to the distributed loop. This is referenced in the docs for the @distributed
. Running @distributed
macro with for
loop spawns an asynchronous green thread (Task
) handle for the distributed execution and moves forward to the next line. Since you want to wait until the execution actually get done the synchronization @sync
is required. The issue with your original code is that without waiting for the green thread to complete (synchronizing), it will swallow errors and just return immediately, which is why your sol
array is empty. You can see this if you run your original code, and only add the @sync
- you will then get a TaskFailedException: on worker 2 - UndefVarError: #sys! not defined
which tells you that your worker processes don't know about the functions you defined on the master process. In practice you will almost always want @sync
execution, unless you you plan to run many such distributed loops in parallel. You also do not need the @sync
keyword where you use aggregator function in the distributed loop (the @distributed (func) for i in 1:1000
form of the loop)
Now what's the best solution here? The answer is I don't know. @threads
is a great option to quickly parallelize thread-safe operation without rewriting the code, and is still actively being developed and improved so likely to become even better in future. There is also pmap
in the Distributed standard library which gives you additional options, but this answer is long enough as it is! In my personal experience, nothing replaces (1) thinking about your problem and (2) benchmarking execution. Things you want to think about are the runtime of your problem (both total and for every individual operation that you want to distribute) and message passing/memory access requirements.
The upside is that while you might have to spend a bit of effort thinking about things, Julia has a bunch of great options to make the most out of every hardware situation from a crappy old laptop with two cores (like the one I'm typing this from) to multi-node super high performance clusters (which has made Julia one of the very few programming language to achieve petaflop performance - although to be fair this is a little more tricky then my or Bill's answer :))