Apple Silicon · Performance Engineering

972 ms0.30 ms

0×faster than optimized NumPy

The same Mandelbrot workload. The same machine. What changes is how much of the M4 Max you actually use — SIMD lanes, all 14 CPU cores, algorithmic symmetry, and finally the 40-core GPU. Every step measured, verified, and documented below.

0.30 msMetal GPU — 1400×800 × 256 iterations
0.39 msC++ NEON — 14 CPU cores
38 Gpx·it/sGPU throughput at 72 megapixels
99.8 %pixels within ±1 iteration of float64 reference
The platform

One chip, four engines

Peak performance on Apple Silicon isn't one trick — it's refusing to leave any of the M4 Max's execution resources idle.

NEON · SIMD

128-bit vectors, FMA

Four float32 lanes per instruction on every core, with fused multiply-add and branch-free lane masking.

14 CORES

10 P + 4 E, libdispatch

dispatch_apply fans work across performance and efficiency cores with work-stealing — no thread-pool code.

40-CORE GPU

Metal compute

Thousands of threads for embarrassingly parallel math, shaders compiled at runtime from source.

UNIFIED MEMORY

Zero-copy CPU⇄GPU

One physical memory. The GPU writes, the CPU reads the very same bytes — no PCIe transfer, no staging buffers.

The baseline

Five orders of magnitude on one chart

The repository benchmarks SQL engines (recursive CTEs!), Python and NumPy against native code. Same fractal, same escape semantics — from 45 seconds to a third of a millisecond.

Benchmark ladder: SQLite ~45 s down to Metal GPU 0.30 ms, log scale
Single-run wall-clock, 1400×800 px × 256 iterations, Apple M4 Max. Log scale — each gridline is 10×.
Case study

Ten optimizations, each one measured

Escape-time Mandelbrot: iterate z = z² + c per pixel until |z|² > 4. Simple to state, brutal to make fast — the loop is a serial dependency chain with unpredictable exit.

01

Compile for the exact silicon foundation

A dylib built at first import with clang++ -O3 -mcpu=apple-m4, called from Python via ctypes. The compiler schedules for the M4's real pipeline widths and latencies. Build cost paid once, outside the timed path.

02

NEON SIMD — 4 pixels per instruction the big unlock

The escape loop runs on float32x4_t vectors. Escaped lanes are masked out branch-free; iteration counts accumulate by subtracting the all-ones comparison mask. float32 over float64 doubles the lanes — at this zoom the pixel pitch is ~10⁴ ulp of f32, so precision is ample (verified: 99.8 % of pixels within ±1 iteration of the f64 reference).

float32x4_t zr2 = vmulq_f32(zr, zr), zi2 = vmulq_f32(zi, zi);
active = vandq_u32(active, vcleq_f32(vaddq_f32(zr2, zi2), four));
count  = vsubq_u32(count, active);  // +1 where lane still alive
zi = vmlaq_f32(ci, vmulq_f32(two, zr), zi);  // fused 2·zr·zi + ci
03

All 14 cores via libdispatch ~11× of ideal 14×

dispatch_apply spreads rows across P- and E-cores with built-in work-stealing. Two profiler-enforced findings: QOS_USER_INTERACTIVE was 4× worse than USER_INITIATED for this workload, and the pool is warmed by a tiny dispatch at import so the first timed call doesn't pay thread-creation latency.

04

Cardioid & bulb early-out skips the costliest 22 %

Points inside the main cardioid and period-2 bulb never escape — and they're the most expensive pixels, burning all 256 iterations. Both regions have closed-form membership tests: a handful of FLOPs replaces 256 iterations, evaluated vectorized before the loop.

// q·(q + (cr − ¼)) ≤ ¼·ci²  →  inside main cardioid
float32x4_t q = vmlaq_f32(ci2, crm, crm);
uint32x4_t in_set = vcleq_f32(vmulq_f32(q, vaddq_f32(q, crm)),
                              vmulq_f32(quarter, ci2));
05

Instruction-level parallelism: two chains in flight G=2 sweet spot

z = z² + c is a serial dependency chain — one 4-lane chain can't fill the M4's SIMD pipes. Interleaving two independent vector groups (8 pixels) keeps the FMA units busy while each chain waits on its own latency. Measured: G=1 → 0.71 ms, G=2 → 0.39 ms, G=3 → 0.46 ms (register pressure wins).

06

Conjugation symmetry — the free 2× algorithm beats micro-opt

The viewport is symmetric about the real axis, and conj(z² + c) = conj(z)² + conj(c) — with IEEE negation being exact, escape counts are bit-identical for mirrored rows. Compute the top half, memcpy the rest. The best optimization in the whole project is a line of mathematics.

07

Amortized escape checks 0.79 → 0.39 ms with #6

The horizontal reduce (vmaxvq_u32) + branch asking "any lane alive?" serializes the loop. It now runs every 4th iteration. Per-lane masking stays exact — dead lanes idle through up to 3 wasted iterations, and inf/NaN comparisons can never resurrect them.

08

Metal compute shader, compiled at runtime enter the GPU

The MSL kernel ships as a source string — newLibraryWithSource compiles it at import, no offline toolchain. One thread per top-half pixel; each writes its pixel and the mirrored one. The cardioid early-out returns whole 32-wide SIMD groups early in bulk regions.

kernel void mandel(device ushort *out [[buffer(0)]],
                   constant uint4 &p  [[buffer(1)]],
                   uint2 gid [[thread_position_in_grid]])
{
    // … escape loop …
    out[gid.y * w + gid.x]           = (ushort)it;
    out[(h - 1 - gid.y) * w + gid.x] = (ushort)it; // mirror row
}
09

Unified memory, zero-copy 0.41 → 0.30 ms

The GPU writes into a StorageModeShared buffer; Python receives a numpy view of the same physical memory. No 2.2 MB memcpy, no staging, hazard tracking off — waitUntilCompleted is the only synchronization. This is the Apple Silicon advantage discrete GPUs can't match at small sizes.

10

Threadgroup shape measured, not guessed

Swept 32×4 / 32×8 / 32×16 / 32×32 threadgroups: flat through 16, clear regression at 32 (occupancy). Shipped 32×8 = 256 threads per group, matching Apple's guidance — but only after the sweep confirmed it.

Optimization journey: NumPy 972 ms, v1 NEON 0.79 ms, v2 symmetry 0.39 ms, v3 Metal GPU 0.30 ms
The journey on one chart. Log scale; labels show cumulative speedup vs NumPy.
CPU vs GPU

The crossover — and why overhead is destiny

A Metal dispatch carries a fixed ~0.2 ms command-buffer cost. At sub-millisecond workloads the 14-core CPU is nearly even; scale up and the 40-core GPU runs away.

WorkloadC++ NEON · 14 coresMetal · 40-core GPUGPU advantage
1400×800 (1.1 Mpx)0.52 ms0.34 ms1.5×
2800×1600 (4.5 Mpx)1.38 ms0.61 ms2.2×
5600×3200 (18 Mpx)4.94 ms1.58 ms3.1×
11200×6400 (72 Mpx)24.5 ms1.87 ms13.1×
1400×800 @ 4096 iter3.05 ms0.80 ms3.8×
CPU vs GPU scaling chart, log-log, gap grows with workload size
Log–log scaling. CPU grows linearly with pixels; the GPU barely notices until tens of megapixels.
72 megapixels of 256-iteration Mandelbrot in 1.87 ms is ≈ 38 billion pixel-iterations per second — on a laptop, on battery, in silence.
Negative results

What didn't work

Performance engineering is empirical. These all sounded right and measured wrong.

QOS_USER_INTERACTIVE

"Highest priority = fastest," right? Median regressed 4× vs USER_INITIATED. The scheduler knows things you don't.

G=3, G=4 interleave

More chains in flight should hide more latency — but past G=2, register pressure and spills beat the gains.

1024-thread groups

Maxing threadgroup size cost measurable occupancy. 256 threads/group won the sweep.

Deeper unrolling (v1 era)

Until symmetry halved the work, the kernel was overhead-bound — unroll factors measured as pure noise. Fix the algorithm before polishing the loop.

Trust, but verify

Every step was checked before it was kept

Bit-exactness across restructures. v2's top half is bit-identical to v1; every ILP variant produced identical output before timing was even considered.

Accuracy vs float64. Every escaped pixel compared against the f64 reference: ≥ 99.76 % within ±1 iteration at every step (CPU f32: 99.81 %, GPU fast-math: 99.77 %). The differences live on chaotic boundary filaments.

Identical semantics. Same escape convention as the repo's Python and SQL implementations — count = iterations survived before |z|² > 4.

Honest timing. Best and median of 15–31 runs for steady state; fresh-process single calls for what a benchmark harness actually sees; contenders re-measured on an idle machine.

Reproduce it

Clone, build, measure

Everything is open source — the kernels, the harness, the charts, this page.

git clone https://github.com/jirakj/sql-mandelbrot-benchmark
# native kernels build themselves on first import
uv run python main.py        # full suite — SQL engines to Metal
uv run python cppbrot.py     # C++ NEON kernel (0.39 ms)
uv run python metalbrot.py   # Metal GPU kernel (0.30 ms)

Full write-up with every measurement: OPTIMIZATIONS.md