Three.js From Zero · Article s2-09
GPGPU with TSL Compute Nodes
GPGPU with TSL Compute Nodes
Ten thousand flocking boids. Each one looks at its neighbors, computes separation + alignment + cohesion, steers accordingly, integrates velocity. On the CPU that's millions of operations per frame — you'd be lucky to hit 30fps with a few hundred agents. On the GPU you get 100,000+ at 120fps.
The difference is compute shaders — code that runs on the GPU for general-purpose math, not rendering. Three.js exposes these via TSL (the node-based shader language we met in Article 08). This article explains the concept, shows you the TSL code for a boids simulation, and includes a live CPU fallback so you can see the simulation's behavior while reading the GPU version.
This demo runs on the CPU for portability — you'll see ~800 boids smoothly. The same simulation on a WebGPU GPU compute shader does 100,000 boids at 120fps. The article below shows both paths.
What "GPGPU" means
General-Purpose GPU computing. Using the GPU for any parallel math — not just drawing triangles. Particles, cloth, fluids, hair, pathfinding, physics, image processing, neural networks. Anything where the answer for each element doesn't depend on the previous element in a tight sequence.
The GPU has thousands of tiny cores. A CPU has 8–16 fast cores. If you have 100,000 independent things to compute (one per boid), the GPU finishes in 0.3ms. The CPU needs 50ms — or about 165× slower.
The compute shader model
In a compute shader, you write a function that runs once per element. The GPU schedules thousands of threads in parallel, each with a unique ID. Your function reads input data by index, does some math, writes output data by index.
// Pseudocode — one compute invocation per boid
function updateBoid(invocationID) {
const i = invocationID;
const pos = positions[i];
const vel = velocities[i];
let sep = vec3(0), align = vec3(0), coh = vec3(0);
let n = 0;
for (let j = 0; j < N; j++) {
if (i === j) continue;
const d = distance(pos, positions[j]);
if (d < perception) {
sep += (pos - positions[j]) / d;
align += velocities[j];
coh += positions[j];
n++;
}
}
if (n > 0) {
align /= n;
coh = (coh / n) - pos;
}
const newVel = vel + sep * sepW + align * alignW + coh * cohW;
velocities[i] = newVel;
positions[i] = pos + newVel * dt;
}
The entire scene of 100,000 boids runs this function 100,000 times in parallel per frame. CPU: 100,000 × 100,000 = 10 billion distance checks. Death. GPU: same math, divided across 4,096 cores, finishes in milliseconds.
TSL Compute — the actual code
Here's the real TSL version. Note the lack of GLSL string — it's all JavaScript function calls that compile to WGSL (WebGPU) or GLSL (WebGL-fallback):
import { compute, storageBuffer, Fn, If, vec3, instanceIndex, uniform, length } from 'three/tsl';
import { WebGPURenderer } from 'three/webgpu';
const COUNT = 100_000;
const positions = storageBuffer(new Float32Array(COUNT * 3), 'vec3', COUNT).toReadWrite();
const velocities = storageBuffer(new Float32Array(COUNT * 3), 'vec3', COUNT).toReadWrite();
const perception = uniform(1.8);
const sepWeight = uniform(1.5);
const alignWeight = uniform(1.0);
const cohWeight = uniform(1.0);
const dt = uniform(0.016);
// The compute kernel — runs once per invocation
const updateBoids = Fn(() => {
const i = instanceIndex;
const pos = positions.element(i);
const vel = velocities.element(i);
const sep = vec3(0).toVar();
const align = vec3(0).toVar();
const coh = vec3(0).toVar();
const n = float(0).toVar();
// Neighborhood loop — each invocation reads all positions (O(n²))
// Real production code uses a spatial hash to make this O(n · k)
Loop(COUNT, ({ i: j }) => {
If(i.notEqual(j), () => {
const other = positions.element(j);
const d = length(pos.sub(other));
If(d.lessThan(perception), () => {
sep.addAssign(pos.sub(other).div(d));
align.addAssign(velocities.element(j));
coh.addAssign(other);
n.addAssign(1);
});
});
});
// Apply weights, update velocity + position
const steering = sep.mul(sepWeight)
.add(align.div(n.max(1)).mul(alignWeight))
.add(coh.div(n.max(1)).sub(pos).mul(cohWeight));
velocities.element(i).assign(vel.add(steering.mul(dt)).clamp(-5, 5));
positions.element(i).assign(pos.add(velocities.element(i).mul(dt)));
});
// Dispatch the kernel — run on GPU
const computePass = updateBoids.compute(COUNT);
await renderer.computeAsync(computePass);
The Loop, If, Fn primitives look like JS but are
really building a compute shader graph. Three.js compiles it to WGSL for WebGPU or GLSL
for WebGL (with limitations — not all operations are WebGL-compatible).
Storage buffers — the GPU's persistent memory
A storage buffer lives in GPU memory across frames. You initialize it on the CPU once, then the GPU reads and writes it. No CPU↔GPU round-trip per frame.
const positions = storageBuffer(
new Float32Array(COUNT * 3), // initial data
'vec3', // element type
COUNT, // number of elements
).toReadWrite(); // compute can both read + write
Use the same storage buffer as attribute-style data for your rendering
pass too — the renderer reads positions.element(instanceIndex) directly as
the vertex input, and the compute pass writes it. Zero data copy between compute and
render.
Workgroups and invocation IDs
GPU compute executes in workgroups — small batches of threads (often 64 or 256) that can share memory and sync within the group. If you have 100,000 elements and workgroup size 64, the GPU schedules 1,563 workgroups of 64 each.
instanceIndex— which invocation this thread is, 0 to N-1. Use as your per-element index.workgroupID/localInvocationIndex— for shared-memory optimizations.- Default workgroup size in TSL is 64. Override when you need different tile sizes.
Rendering the computed data
The render pass reads the same storage buffer as an instance attribute. No copy, no sync:
import { positionLocal, attribute } from 'three/tsl';
const boidGeom = new THREE.ConeGeometry(0.05, 0.2, 8);
const boidMesh = new THREE.InstancedMesh(boidGeom, boidMat, COUNT);
// Tell the material to read instance positions from the storage buffer
boidMat.positionNode = positionLocal.add(attribute(positions));
Perf: the O(n²) problem + spatial hashing
The naive loop checks every boid against every other boid — O(n²). At 100k boids that's 10 billion checks per frame. Too slow even on GPU.
The fix: a spatial hash grid. Divide the world into cells (~perception radius sized). Each boid only checks its cell + the 26 neighbor cells. Brings the work down to O(n · k) where k = average neighbors.
Building the spatial grid is itself a compute pass (count boids per cell → prefix sum → scatter). Three separate compute dispatches per frame, but together ~10× faster than naïve at 100k agents.
CPU fallback — what this demo does
For this single-file demo we use a plain CPU loop with 800 boids. The TSL compute
version needs WebGPU, which the single-HTML preview path doesn't cleanly support. The
article shows the TSL code you'd drop into a Next.js / Vite project that imports
three/webgpu.
// CPU version the demo actually runs
function updateCPU(dt) {
for (let i = 0; i < N; i++) {
let sx = 0, sy = 0, sz = 0; // separation
let ax = 0, ay = 0, az = 0; // alignment
let cx = 0, cy = 0, cz = 0; // cohesion
let n = 0;
for (let j = 0; j < N; j++) {
if (i === j) continue;
const dx = px[i] - px[j], dy = py[i] - py[j], dz = pz[i] - pz[j];
const d2 = dx*dx + dy*dy + dz*dz;
if (d2 < P2) {
const d = Math.sqrt(d2) || 1;
sx += dx / d; sy += dy / d; sz += dz / d;
ax += vx[j]; ay += vy[j]; az += vz[j];
cx += px[j]; cy += py[j]; cz += pz[j];
n++;
}
}
// ... apply weights, integrate
}
}
Other GPGPU wins
Once you're comfortable with compute dispatches, these open up:
- Particles — 1M+ particles with per-particle life, gravity, collision
- Cloth — mass-spring networks solved on the GPU
- Fluid — SPH or grid-based, real-time water / smoke
- Hair — strand constraints, same math as cloth
- Pathfinding — parallel A* variants, flow fields
- ML inference — ONNX Runtime uses WebGPU under the hood for matmuls
WebGPU vs WebGL
| WebGL 2 | WebGPU | |
|---|---|---|
| Compute shaders | No (fake via fragment shaders) | Yes, first-class |
| Storage buffers | Limited | Yes, full |
| Browser support (2025) | All | Chrome, Edge, Safari 18+, Firefox (partial) |
| TSL compilation target | GLSL | WGSL |
| Performance | Baseline | Often 2-5× faster |
TSL + WebGPURenderer is the forward path. WebGL still covers more devices but loses compute entirely. For serious GPGPU on the web today, pick WebGPU and accept you'll lose a percentage of users on older browsers.
Common first-time pitfalls
- "Cannot find WebGPURenderer" — import from
three/webgpu, not the main entry. - Compute pass runs but nothing moves — you're not using the storage buffer as the render-time positions. Wire it via
positionNodeon the material. - Browser crash / GPU hang — infinite loop in the kernel. Always bound Loop counts.
- Works in Chrome, broken in Safari — Safari WebGPU is newer, some TSL nodes aren't supported. Check `renderer.hasFeature('...')` and graceful-fallback.
- Frame rate low despite GPU compute — rendering 100k instances is still expensive. Use `InstancedMesh` and a simple material, not PBR.
- Results differ slightly between runs — floating-point compute is not bit-deterministic on the GPU across vendors. Accept this or use fixed-point for deterministic sims.
Exercises
- Add mouse attraction: a uniform for cursor position in world space, add a pull-toward-cursor term to the steering.
- Spatial hash: build a grid-based optimization. Measure the speedup at 100k boids.
- Port to particles: swap the boid rules for a particle emitter (spawn from a ring, apply gravity, fade out by age).
What's next
Article S2-10 — Procedural Worlds. The Season 2 finale. Noise-based heightfield terrain, quadtree LOD, chunk streaming, and instance scattering for grass and rocks. Fly over an infinite world.