Dagger.jl Fast, Smart Parallelism

Dagger Basics and Essentials

Written by Felipe Tome

This guide covers the core Dagger.jl pieces you will need for basic usage:

  • Dagger.@spawn / Dagger.spawn

  • task options (Options, inline options, with_options)

  • visualization basics

  • DArrays (distributed arrays)

  • datadeps (spawn_datadeps, In, Out, InOut, Deps)

  • GPU usage

  • multi-GPU usage

  • distributed execution

The snippets below were checked against the environment: Julia 1.12.4, Dagger 0.19.3.

TL;DR

  • Use Dagger.@spawn to create a task graph; passing one DTask into another creates dependencies.

  • Use options either inline (Dagger.@spawn scope=... name=... f(x)) or with Dagger.Options.

  • Use Dagger.with_options for block-scoped defaults.

  • For basic graph visualization, enable logging, run tasks, then call Dagger.show_logs(..., :graphviz).

  • Use DArrays when you want array-style operations over partitioned data.

  • For mutable shared data, use Dagger.spawn_datadeps with In/Out/InOut.

  • For GPU execution, load a backend (CUDA/AMDGPU/oneAPI/Metal) and either run unpinned (automatic placement) or pin with scope(...).

  • For distributed runs, add workers and scope tasks to specific workers when needed.

  1. Minimal setup

julia

Then in Julia:

using Dagger

If you want more CPU parallelism, launch Julia with threads:

julia -t8

  1. @spawn essentials

Think of each Dagger.@spawn as one node in a DAG (Directed Acyclic Graph): a graph where edges are dependencies and there are no cycles.

  • If a task argument is another DTask, Dagger adds an edge automatically.

  • fetch(task) waits and returns the final value.

  • wait(task) waits but does not return the value.

using Dagger

square(x) = x * x
inc(x) = x + 1

a = Dagger.@spawn square(4)   # returns DTask
b = Dagger.@spawn inc(a)      # depends on a

@show fetch(b)  # 17

You can also use the function form:

t = Dagger.spawn(+, 40, 2)
@show fetch(t)  # 42

  1. How to use options

Dagger options are per-task execution hints/configuration. They let you control how a task is scheduled and labeled, for example:

  • where it can run (scope)

  • how it appears in logs/visualizations (name)

  • other runtime behavior through Dagger.Options fields

You can think of options as metadata attached to a task definition, not the task's actual input data.

In practice, options let you:

  • place tasks on specific resources (CPU worker/thread/GPU) with scope

  • make task graphs easier to read with name

  • add explicit extra task dependencies with syncdeps

  • control advanced behavior like raw-chunk execution (meta) and scheduling/resource hints

You have three common ways to set options.

A) Inline on @spawn

using Dagger

t = Dagger.@spawn name="task-add" +(40, 2)
@show fetch(t)  # 42

Two useful options to start with:

  • scope: where a task can run (worker/thread/device constraints)

  • name: friendly name for logging and visualization

Example:

using Dagger

t = Dagger.@spawn scope=Dagger.scope(worker=1) name="local-sqrt" sqrt(81)
@show fetch(t)  # 9.0

B) Explicit Dagger.Options

using Dagger

opts = Dagger.Options(; name="sum-task", scope=Dagger.scope(worker=1))
t = Dagger.spawn(+, opts, 10, 32)
@show fetch(t)  # 42

C) Block-scoped defaults with Dagger.with_options

Use this when you want many tasks in one block to inherit the same option values.

using Dagger

t = Dagger.with_options(; scope=Dagger.scope(worker=1), name="scoped-task") do
    Dagger.@spawn sqrt(81)
end

@show fetch(t)  # 9.0

  1. Visualization basics

Basic flow:

  1. enable logging

  2. run/fetch your tasks

  3. collect logs

  4. render and save (inside Julia)

What you can generate (and why)

  • DAG graph (:graphviz): a dependency graph (Directed Acyclic Graph) of tasks and edges. This is best for understanding task ordering, missing/extra dependencies, and overall pipeline structure.

  • Gantt chart (:plots_gantt): a timeline view where bars show when tasks ran and on which processor. A Gantt chart is best for performance debugging: spotting idle gaps, load imbalance, serialization, or bottlenecks.

For Gantt charts, Dagger can render different targets (:execution, :processor, :scheduler) depending on whether you want task-level timing, processor activity, or scheduler-event timing.

Pure Julia graph visualization + save

using Dagger
using GraphViz   # install once: import Pkg; Pkg.add("GraphViz")

Dagger.enable_logging!(all_task_deps=true)

x = Dagger.@spawn name="base" sum(1:10)
y = Dagger.@spawn name="double" x * 2
fetch(y)

logs = Dagger.fetch_logs!()
gv = Dagger.render_logs(logs, :graphviz)  # GraphViz.Graph

# Save SVG directly from Julia (no external CLI pipeline)
open("dagger_graph.svg", "w") do io
    show(io, MIME"image/svg+xml"(), gv)
end

Dagger.disable_logging!()

Optional richer renderers

  • If GraphViz.jl is installed: Dagger.render_logs(logs, :graphviz)

  • If Plots.jl + DataFrames.jl are installed: Dagger.render_logs(logs, :plots_gantt)

Saving generated visualizations

Save graph output as SVG inside Julia:

using Dagger
using GraphViz

Dagger.enable_logging!(all_task_deps=true)
t = Dagger.@spawn sum(1:10)
fetch(t)
logs = Dagger.fetch_logs!()

gv = Dagger.render_logs(logs, :graphviz)
open("dagger_graph.svg", "w") do io
    show(io, MIME"image/svg+xml"(), gv)
end

Dagger.disable_logging!()

Save Gantt plots rendered through Plots:

using Dagger
using Plots, DataFrames

Dagger.enable_logging!(all_task_deps=true)
t = Dagger.@spawn sum(1:10)
fetch(t)
logs = Dagger.fetch_logs!()

p = Dagger.render_logs(logs, :plots_gantt)
savefig(p, "dagger_gantt.png")
savefig(p, "dagger_gantt.pdf")

Dagger.disable_logging!()

  1. DArrays basics

DArray is Dagger's distributed array type: data is partitioned into chunks, and operations can run chunkwise in parallel.

Create and use a DArray

using Dagger

A = Dagger.distribute(rand(Float32, 64, 64), Dagger.Blocks(32, 32))
B = map(x -> 2f0 * x, A)
s = sum(B)
M = collect(B)

@show typeof(A) size(A)
@show s
@show size(M) eltype(M)

Native constructors with block partitioning

using Dagger

A = rand(Dagger.Blocks(64, 64), Float32, 256, 256)
Z = zeros(Dagger.Blocks(64, 64), Float32, 256, 256)

@show typeof(A) typeof(Z)
@show size(A) size(Z)

  1. Datadeps basics

Datadeps ("data dependencies") is Dagger's way to safely schedule tasks that read/write shared mutable data by declaring how each task accesses that data.

Use datadeps when tasks mutate shared data.

The rule of thumb:

  • In(x): read x

  • Out(x): write x

  • InOut(x): read and write x

  • unspecified dependency defaults to read (In)

Run mutable task groups inside Dagger.spawn_datadeps() do ... end.

using Dagger

fill1!(x) = (fill!(x, 1); nothing)
add2!(x) = (x .+= 2; nothing)
sumv(x) = sum(x)

A = zeros(Int, 6)
t_before = Ref{Any}()
t_after = Ref{Any}()

Dagger.spawn_datadeps() do
    Dagger.@spawn fill1!(Out(A))          # writes A
    t_before[] = Dagger.@spawn sumv(In(A))
    Dagger.@spawn add2!(InOut(A))         # mutates A
    t_after[] = Dagger.@spawn sumv(In(A))
end

@show fetch(t_before[])  # 6
@show fetch(t_after[])   # 18
@show A                  # [3, 3, 3, 3, 3, 3]

Important behavior in datadeps regions:

  • Dagger determines ordering from dependency annotations.

  • spawn_datadeps waits for all submitted tasks before returning.

  • Avoid calling fetch on tasks from inside the same datadeps block.

About Deps(...)

Deps(x, ...) is an advanced wrapper used to attach one or more dependency modifiers to the same value (for finer-grained partial-region dependency tracking). Most users can start with In/Out/InOut and only move to Deps when they need custom aliasing behavior.

Datadeps with DArray chunks

using Dagger

add2_chunk!(x) = (x .+= 2f0; nothing)

A = zeros(Dagger.Blocks(32, 32), Float32, 64, 64)

Dagger.spawn_datadeps() do
    for Ac in Dagger.chunks(A)
        Dagger.@spawn fill!(Out(Ac), 1f0)
        Dagger.@spawn add2_chunk!(InOut(Ac))
    end
end

M = collect(A)
@show sum(M)        # 12288.0
@show unique(M)     # Float32[3.0]

  1. GPU usage (single GPU)

GPU execution is opt-in through backend packages. Load one of:

  • using CUDA (NVIDIA)

  • using AMDGPU (AMD)

  • using oneAPI (Intel)

  • using Metal (Apple)

You can use GPUs in two modes:

  1. Unpinned: let Dagger place tasks automatically (usually based on GPU-array compatibility).

  2. Pinned: force task placement to specific device(s) using scope(...).

Unpinned CUDA example:

using Dagger, CUDA

CUDA.functional() || error("CUDA is not functional")

A = CUDA.rand(Float32, 2048, 2048)
t = Dagger.@spawn sum(abs2, A)   # no explicit scope pinning

@show fetch(t)

Pinned mode uses backend-specific scope keys:

  • CUDA: cuda_gpu / cuda_gpus

  • AMDGPU: rocm_gpu / rocm_gpus

  • oneAPI: intel_gpu / intel_gpus

  • Metal: metal_gpu / metal_gpus

If the backend package is not loaded, those scope keys are not available.

Pinned CUDA example:

using Dagger, CUDA

CUDA.functional() || error("CUDA is not functional")

A = CUDA.rand(Float32, 2048, 2048)
t = Dagger.@spawn scope=Dagger.scope(cuda_gpu=1, worker=1) sum(abs2, A)

@show fetch(t)

DArray + GPU example (CUDA)

using Dagger, CUDA

CUDA.functional() || error("CUDA is not functional")

gpu_chunk_sum(x) = sum(abs2, CUDA.CuArray(x))

A = fetch(rand(Dagger.Blocks(512, 512), Float32, 2048, 2048))
chunk_tasks = [Dagger.@spawn gpu_chunk_sum(c) for c in Dagger.chunks(A)]
total = sum(fetch.(chunk_tasks))

@show total

  1. Multi-GPU usage

For multi-GPU workloads you can also choose unpinned or pinned placement.

Unpinned pattern (allocate inputs on each GPU, then spawn normally):

using Dagger, CUDA

CUDA.functional() || error("CUDA is not functional")
length(CUDA.devices()) > 0 || error("No CUDA devices found")

arrays = [CUDA.device!(dev) do
    CUDA.rand(Float32, 2048, 2048)
end for dev in CUDA.devices()]

tasks = [Dagger.@spawn sum(abs2, A) for A in arrays]   # no scope pinning
results = fetch.(tasks)
@show length(results)

Pinned pattern (explicit one-task-per-device):

using Dagger, CUDA

gpu_sum(n) = sum(abs2, CUDA.rand(Float32, n, n))

ngpu = length(CUDA.devices())
ngpu > 0 || error("No CUDA devices found")

tasks = [
    Dagger.@spawn scope=Dagger.scope(cuda_gpu=g, worker=1) name="gpu-$g" gpu_sum(2048)
    for g in 1:ngpu
]

results = fetch.(tasks)
@show ngpu results

Multi-GPU with DArray chunks (pinned CUDA example)

using Dagger, CUDA

CUDA.functional() || error("CUDA is not functional")
ngpu = length(CUDA.devices())
ngpu > 0 || error("No CUDA devices found")

gpu_chunk_sum(x) = sum(abs2, CUDA.CuArray(x))

A = fetch(rand(Dagger.Blocks(1024, 1024), Float32, 4096, 4096))
tasks = [
    Dagger.@spawn scope=Dagger.scope(cuda_gpu=mod1(i, ngpu), worker=1) gpu_chunk_sum(c)
    for (i, c) in enumerate(Dagger.chunks(A))
]

total = sum(fetch.(tasks))
@show total

Notes:

  • This pattern scales when each GPU gets enough work (coarse tasks are better than many tiny tasks).

  • For mixed CPU/GPU pipelines, keep dependent tasks on the same GPU when possible to reduce transfers.

  • Use pinning when you need deterministic device placement or strict resource partitioning.

  1. Distributed execution (multi-process / multi-node)

Dagger can schedule across Julia workers (Distributed processes).

Basic workflow:

  1. start workers (julia -p N or addprocs)

  2. load packages on all workers

  3. spawn tasks normally (or pin with scope(worker=...))

using Distributed
pids = addprocs(2)

@everywhere using Dagger
using Dagger

t_any = Dagger.@spawn sum(1:1_000_000)
t_w1 = Dagger.@spawn scope=Dagger.scope(worker=pids[1]) myid()
t_w2 = Dagger.@spawn scope=Dagger.scope(worker=pids[2]) myid()

@show fetch(t_any)
@show fetch(t_w1) fetch(t_w2)
rmprocs(pids)

Distributed + GPU pinning example (CUDA):

using Distributed
pids = addprocs(2)

@everywhere using Dagger, CUDA
using Dagger

t = Dagger.@spawn scope=Dagger.scope(worker=pids[1], cuda_gpu=1) CUDA.device()
@show fetch(t)
rmprocs(pids)

  1. Common pitfalls

  • @spawn expects a single function call expression, not an arbitrary begin ... end block.

  • If tasks mutate shared data and you do not use datadeps annotations, results may be wrong.

  • Over-constraining scope can leave no eligible processors and cause scheduling errors.

  • Logging/visualization has overhead; disable it in normal benchmark runs.

  • Backend-specific GPU scope keys only exist after loading the backend package.

  • Unpinned GPU scheduling is data-driven; use pinning when exact GPU assignment matters.

  • Dagger.chunks(...) is not exported; call it with the Dagger. prefix.

  • In distributed mode, remember @everywhere using ... for packages/functions needed on workers.

  1. Practical starter template

using Dagger, GraphViz

function run_pipeline(x)
    Dagger.enable_logging!(all_task_deps=true)
    try
        t1 = Dagger.@spawn name="square" x^2
        t2 = Dagger.@spawn name="plus1" t1 + 1
        result = fetch(t2)

        logs = Dagger.fetch_logs!()
        gv = Dagger.render_logs(logs, :graphviz)
        open("pipeline.svg", "w") do io
            show(io, MIME"image/svg+xml"(), gv)
        end
        return result
    finally
        Dagger.disable_logging!()
    end
end

References

  • Dagger docs: https://juliaparallel.org/Dagger.jl/stable/

  • Dagger.jl repo: https://github.com/JuliaParallel/Dagger.jl

CC BY-SA 4.0 Julian P Samaroo. Last modified: February 13, 2026.
Website built with Franklin.jl and the Julia programming language.