Article S2-09 · Three.js From Zero

GPGPU with TSL Compute Nodes

Ten thousand flocking boids. Each one looks at its neighbors, computes separation + alignment + cohesion, steers accordingly, integrates velocity. On the CPU that's millions of operations per frame — you'd be lucky to hit 30fps with a few hundred agents. On the GPU you get 100,000+ at 120fps.

The difference is compute shaders — code that runs on the GPU for general-purpose math, not rendering. Three.js exposes these via TSL (the node-based shader language we met in Article 08). This article explains the concept, shows you the TSL code for a boids simulation, and includes a live CPU fallback so you can see the simulation's behavior while reading the GPU version.

initializing…

count perception separation alignment cohesion

This demo runs on the CPU for portability — you'll see ~800 boids smoothly. The same simulation on a WebGPU GPU compute shader does 100,000 boids at 120fps. The article below shows both paths.

What "GPGPU" means

General-Purpose GPU computing. Using the GPU for any parallel math — not just drawing triangles. Particles, cloth, fluids, hair, pathfinding, physics, image processing, neural networks. Anything where the answer for each element doesn't depend on the previous element in a tight sequence.

The GPU has thousands of tiny cores. A CPU has 8–16 fast cores. If you have 100,000 independent things to compute (one per boid), the GPU finishes in 0.3ms. The CPU needs 50ms — or about 165× slower.

The compute shader model

In a compute shader, you write a function that runs once per element. The GPU schedules thousands of threads in parallel, each with a unique ID. Your function reads input data by index, does some math, writes output data by index.

// Pseudocode — one compute invocation per boid
function updateBoid(invocationID) {
  const i = invocationID;
  const pos = positions[i];
  const vel = velocities[i];

  let sep = vec3(0), align = vec3(0), coh = vec3(0);
  let n = 0;

  for (let j = 0; j < N; j++) {
    if (i === j) continue;
    const d = distance(pos, positions[j]);
    if (d < perception) {
      sep += (pos - positions[j]) / d;
      align += velocities[j];
      coh += positions[j];
      n++;
    }
  }

  if (n > 0) {
    align /= n;
    coh = (coh / n) - pos;
  }

  const newVel = vel + sep * sepW + align * alignW + coh * cohW;
  velocities[i] = newVel;
  positions[i] = pos + newVel * dt;
}

The entire scene of 100,000 boids runs this function 100,000 times in parallel per frame. CPU: 100,000 × 100,000 = 10 billion distance checks. Death. GPU: same math, divided across 4,096 cores, finishes in milliseconds.

TSL Compute — the actual code

Here's the real TSL version. Note the lack of GLSL string — it's all JavaScript function calls that compile to WGSL (WebGPU) or GLSL (WebGL-fallback):

import { compute, storageBuffer, Fn, If, vec3, instanceIndex, uniform, length } from 'three/tsl';
import { WebGPURenderer } from 'three/webgpu';

const COUNT = 100_000;
const positions  = storageBuffer(new Float32Array(COUNT * 3), 'vec3', COUNT).toReadWrite();
const velocities = storageBuffer(new Float32Array(COUNT * 3), 'vec3', COUNT).toReadWrite();

const perception = uniform(1.8);
const sepWeight  = uniform(1.5);
const alignWeight = uniform(1.0);
const cohWeight  = uniform(1.0);
const dt         = uniform(0.016);

// The compute kernel — runs once per invocation
const updateBoids = Fn(() => {
  const i = instanceIndex;
  const pos = positions.element(i);
  const vel = velocities.element(i);

  const sep   = vec3(0).toVar();
  const align = vec3(0).toVar();
  const coh   = vec3(0).toVar();
  const n     = float(0).toVar();

  // Neighborhood loop — each invocation reads all positions (O(n²))
  // Real production code uses a spatial hash to make this O(n · k)
  Loop(COUNT, ({ i: j }) => {
    If(i.notEqual(j), () => {
      const other = positions.element(j);
      const d = length(pos.sub(other));
      If(d.lessThan(perception), () => {
        sep.addAssign(pos.sub(other).div(d));
        align.addAssign(velocities.element(j));
        coh.addAssign(other);
        n.addAssign(1);
      });
    });
  });

  // Apply weights, update velocity + position
  const steering = sep.mul(sepWeight)
    .add(align.div(n.max(1)).mul(alignWeight))
    .add(coh.div(n.max(1)).sub(pos).mul(cohWeight));

  velocities.element(i).assign(vel.add(steering.mul(dt)).clamp(-5, 5));
  positions.element(i).assign(pos.add(velocities.element(i).mul(dt)));
});

// Dispatch the kernel — run on GPU
const computePass = updateBoids.compute(COUNT);
await renderer.computeAsync(computePass);

The Loop, If, Fn primitives look like JS but are really building a compute shader graph. Three.js compiles it to WGSL for WebGPU or GLSL for WebGL (with limitations — not all operations are WebGL-compatible).

Storage buffers — the GPU's persistent memory

A storage buffer lives in GPU memory across frames. You initialize it on the CPU once, then the GPU reads and writes it. No CPU↔GPU round-trip per frame.

const positions = storageBuffer(
  new Float32Array(COUNT * 3),   // initial data
  'vec3',                        // element type
  COUNT,                         // number of elements
).toReadWrite();                 // compute can both read + write

Use the same storage buffer as attribute-style data for your rendering pass too — the renderer reads positions.element(instanceIndex) directly as the vertex input, and the compute pass writes it. Zero data copy between compute and render.

Workgroups and invocation IDs

GPU compute executes in workgroups — small batches of threads (often 64 or 256) that can share memory and sync within the group. If you have 100,000 elements and workgroup size 64, the GPU schedules 1,563 workgroups of 64 each.

instanceIndex — which invocation this thread is, 0 to N-1. Use as your per-element index.
workgroupID / localInvocationIndex — for shared-memory optimizations.
Default workgroup size in TSL is 64. Override when you need different tile sizes.

Rendering the computed data

The render pass reads the same storage buffer as an instance attribute. No copy, no sync:

import { positionLocal, attribute } from 'three/tsl';

const boidGeom = new THREE.ConeGeometry(0.05, 0.2, 8);
const boidMesh = new THREE.InstancedMesh(boidGeom, boidMat, COUNT);

// Tell the material to read instance positions from the storage buffer
boidMat.positionNode = positionLocal.add(attribute(positions));

Perf: the O(n²) problem + spatial hashing

The naive loop checks every boid against every other boid — O(n²). At 100k boids that's 10 billion checks per frame. Too slow even on GPU.

The fix: a spatial hash grid. Divide the world into cells (~perception radius sized). Each boid only checks its cell + the 26 neighbor cells. Brings the work down to O(n · k) where k = average neighbors.

Building the spatial grid is itself a compute pass (count boids per cell → prefix sum → scatter). Three separate compute dispatches per frame, but together ~10× faster than naïve at 100k agents.

CPU fallback — what this demo does

For this single-file demo we use a plain CPU loop with 800 boids. The TSL compute version needs WebGPU, which the single-HTML preview path doesn't cleanly support. The article shows the TSL code you'd drop into a Next.js / Vite project that imports three/webgpu.

// CPU version the demo actually runs
function updateCPU(dt) {
  for (let i = 0; i < N; i++) {
    let sx = 0, sy = 0, sz = 0;  // separation
    let ax = 0, ay = 0, az = 0;  // alignment
    let cx = 0, cy = 0, cz = 0;  // cohesion
    let n = 0;

    for (let j = 0; j < N; j++) {
      if (i === j) continue;
      const dx = px[i] - px[j], dy = py[i] - py[j], dz = pz[i] - pz[j];
      const d2 = dx*dx + dy*dy + dz*dz;
      if (d2 < P2) {
        const d = Math.sqrt(d2) || 1;
        sx += dx / d; sy += dy / d; sz += dz / d;
        ax += vx[j]; ay += vy[j]; az += vz[j];
        cx += px[j]; cy += py[j]; cz += pz[j];
        n++;
      }
    }
    // ... apply weights, integrate
  }
}

Other GPGPU wins

Once you're comfortable with compute dispatches, these open up:

Particles — 1M+ particles with per-particle life, gravity, collision
Cloth — mass-spring networks solved on the GPU
Fluid — SPH or grid-based, real-time water / smoke
Hair — strand constraints, same math as cloth
Pathfinding — parallel A* variants, flow fields
ML inference — ONNX Runtime uses WebGPU under the hood for matmuls

WebGPU vs WebGL

	WebGL 2	WebGPU
Compute shaders	No (fake via fragment shaders)	Yes, first-class
Storage buffers	Limited	Yes, full
Browser support (2025)	All	Chrome, Edge, Safari 18+, Firefox (partial)
TSL compilation target	GLSL	WGSL
Performance	Baseline	Often 2-5× faster

TSL + WebGPURenderer is the forward path. WebGL still covers more devices but loses compute entirely. For serious GPGPU on the web today, pick WebGPU and accept you'll lose a percentage of users on older browsers.

Common first-time pitfalls

"Cannot find WebGPURenderer" — import from three/webgpu, not the main entry.
Compute pass runs but nothing moves — you're not using the storage buffer as the render-time positions. Wire it via positionNode on the material.
Browser crash / GPU hang — infinite loop in the kernel. Always bound Loop counts.
Works in Chrome, broken in Safari — Safari WebGPU is newer, some TSL nodes aren't supported. Check `renderer.hasFeature('...')` and graceful-fallback.
Frame rate low despite GPU compute — rendering 100k instances is still expensive. Use `InstancedMesh` and a simple material, not PBR.
Results differ slightly between runs — floating-point compute is not bit-deterministic on the GPU across vendors. Accept this or use fixed-point for deterministic sims.

Exercises

Add mouse attraction: a uniform for cursor position in world space, add a pull-toward-cursor term to the steering.
Spatial hash: build a grid-based optimization. Measure the speedup at 100k boids.
Port to particles: swap the boid rules for a particle emitter (spawn from a ring, apply gravity, fade out by age).

What's next

Article S2-10 — Procedural Worlds. The Season 2 finale. Noise-based heightfield terrain, quadtree LOD, chunk streaming, and instance scattering for grass and rocks. Fly over an infinite world.