Three.js From Zero · Article 14
Hand-Tracked Floating Windows — Vision OS in Your Browser
A wall of macOS-style app windows (Safari, Xcode, Music, TV, Pages, Keynote, Numbers, Photos) floating in 3D. Webcam-driven hand tracking via MediaPipe — hover to highlight, pinch+drag to move, two-hand pinch-spread to resize, air-tap to open/close. Plus toolbar toggles, expandable camera-with-skeleton mini-cam, and scroll-wheel fallback.
A wall of macOS-style app windows — Safari, Xcode, Music, TV, Pages, Keynote, Numbers, Photos — floating in 3D. Your webcam sees your hands. Hover to highlight. Pinch + drag to move and reorder the stack. Pinch + spread with both hands to resize. Quick pinch on any window — air-tap — and it opens to focus mode; the others dim and push back. Air-tap empty space (or hit ESC) to close. Mouse + scroll-wheel work too. Zero installs. Zero servers. One page.
Hand: —
Pinch: —
Hover: —
Open: —
FPS: —
Then: pinch + hold to drag, quick pinch to open / close, both hands pinching to resize. ESC closes.
↑ This is the entire project running on this page. No build step. View source to see all 900 lines.
STEP 01The shape of the thing
We're shipping two layers that have to dance together. Each one is simple on its own. It's the interface between them that earns the wow.
- A 3D scene with eight Mac-style productivity windows in a soft 4×2 grid.
Each is a plane carrying a
CanvasTexturewith animated content — Safari (Apple Newsroom), Xcode (SwiftUI typing itself line by line), Music (Apple Music with a live waveform), TV (Foundation banner + Trending Shows grid), Pages (a document typing itself), Keynote (slide thumbnails + a pulsing live slide), Numbers (Q4 forecast spreadsheet with a growth bar chart), and Photos (12-tile library). - A hand tracker driven by MediaPipe's Hand Landmarker (a ~10 MB ML model that loads from a CDN and runs entirely on-device via WebGL). It hands us 21 landmark points per frame at 30+ fps.
The bridge between them is a single THREE.Vector3 — the hand cursor.
Its position comes from landmark 8 (index fingertip). Its pinch state comes
from the distance between landmarks 4 (thumb tip) and 8.
Smooth it with a One-Euro filter, derive a depth from hand size, drive everything off it.
The rest is polish.
The 21-landmark model
| Index | Landmark | Why it matters here |
|---|---|---|
0 | Wrist | Stable anchor for palm orientation if you need it |
4 | Thumb tip | Half of the pinch pair |
5 | Index MCP (knuckle) | Hand size proxy → depth estimate |
8 | Index fingertip | The cursor. Other half of pinch. |
12, 16, 20 | Middle / Ring / Pinky tips | Future gestures (peace sign, fist, etc.) |
STEP 02Scene, camera, renderer, lights
Nothing exotic. A perspective camera at (0, 0, 3.5). A canvas inside the
demo container. antialias: true. SRGBColorSpace. The
background is a soft purple gradient — easy with scene.background set to
null and a CSS gradient on the wrapper, which is what we do on this page.
import * as THREE from 'three';
const scene = new THREE.Scene();
const camera = new THREE.PerspectiveCamera(55, 16/10, 0.1, 50);
camera.position.set(0, 0, 3.5);
const renderer = new THREE.WebGLRenderer({
canvas: document.getElementById('scene'),
antialias: true,
alpha: true,
});
renderer.setPixelRatio(Math.min(devicePixelRatio, 2));
renderer.outputColorSpace = THREE.SRGBColorSpace;
// Soft ambient + a single key light — windows mostly carry their own colour
// via the CanvasTexture, so lighting is just for the subtle highlights.
scene.add(new THREE.AmbientLight(0xffffff, 0.6));
const key = new THREE.DirectionalLight(0xc084fc, 0.5);
key.position.set(2, 3, 5);
scene.add(key);
We pick FOV 55° deliberately — wider than cinema, narrower than a fisheye.
Fits the 4×2 wall at z = 3.5 with breathing room.
STEP 03The mock window — a plane with a painted canvas
A window is the simplest possible thing: a PlaneGeometry wearing a
MeshBasicMaterial whose map is a CanvasTexture.
The texture is a 2D canvas we draw into per frame for animated content (clock, code,
waveform), once for static content (photos, weather).
function createWindow({ width, height, draw, animated }) {
const canvas = document.createElement('canvas');
canvas.width = 1024;
canvas.height = Math.round(1024 * height / width);
const texture = new THREE.CanvasTexture(canvas);
texture.colorSpace = THREE.SRGBColorSpace;
texture.anisotropy = 4;
const mat = new THREE.MeshBasicMaterial({
map: texture,
transparent: true,
});
const mesh = new THREE.Mesh(
new THREE.PlaneGeometry(width, height),
mat,
);
mesh.userData = { canvas, ctx: canvas.getContext('2d'), texture, draw, animated };
return mesh;
}
The trick: every window has its own draw(ctx, w, h, time, state)
function. Inside that function we paint a rounded background, a title bar, traffic-light
buttons, and the window's unique content. Look at the source on this page —
drawBrowser, drawCode, drawMusic, drawTerminal,
drawWeather, drawPhotos, drawClock, drawMail —
each is fifty to a hundred lines of native 2D canvas. No frameworks, no DOM.
Why CanvasTexture and not iframes? An iframe is a separate document that you can not render into a WebGL texture without painful security plumbing. A 2D canvas is just pixels we own, painted however we like, and uploaded to the GPU once per frame. For mock content, it wins by a mile.
Drawing a rounded glass window background
function roundRect(ctx, x, y, w, h, r) {
ctx.beginPath();
ctx.moveTo(x + r, y);
ctx.arcTo(x + w, y, x + w, y + h, r);
ctx.arcTo(x + w, y + h, x, y + h, r);
ctx.arcTo(x, y + h, x, y, r);
ctx.arcTo(x, y, x + w, y, r);
ctx.closePath();
}
function drawFrame(ctx, w, h, accent) {
ctx.clearRect(0, 0, w, h);
// Glassy body
const grad = ctx.createLinearGradient(0, 0, 0, h);
grad.addColorStop(0, 'rgba(28, 24, 44, 0.92)');
grad.addColorStop(1, 'rgba(15, 12, 26, 0.92)');
ctx.fillStyle = grad;
roundRect(ctx, 0, 0, w, h, 36);
ctx.fill();
// Top hairline highlight = "glass edge"
ctx.strokeStyle = 'rgba(255, 255, 255, 0.12)';
ctx.lineWidth = 2;
ctx.stroke();
// Title bar
ctx.fillStyle = 'rgba(255, 255, 255, 0.04)';
roundRect(ctx, 0, 0, w, 70, 36);
ctx.fill();
// Three traffic lights
['#ff5f57', '#febc2e', '#28c840'].forEach((c, i) => {
ctx.fillStyle = c;
ctx.beginPath();
ctx.arc(36 + i * 28, 36, 10, 0, Math.PI * 2);
ctx.fill();
});
// Accent pill (top right) — gives the window its identity colour
ctx.fillStyle = accent;
ctx.fillRect(w - 96, 28, 52, 4);
}
Every drawX(ctx,...) calls drawFrame first, then paints its
unique content into the body region below the title bar.
STEP 04Loading MediaPipe — the whole ML model in one CDN line
MediaPipe Tasks Vision ships a HandLandmarker that runs entirely in your
browser. No server. No upload. The model is ~10 MB and gets cached by the browser after
the first load.
// Click handler on the Start Camera button:
async function startCamera() {
// Dynamic import — MediaPipe is heavy, don't pay for it until the user opts in.
const mp = await import(
'https://cdn.jsdelivr.net/npm/@mediapipe/[email protected]/vision_bundle.mjs'
);
const vision = await mp.FilesetResolver.forVisionTasks(
'https://cdn.jsdelivr.net/npm/@mediapipe/[email protected]/wasm'
);
handLandmarker = await mp.HandLandmarker.createFromOptions(vision, {
baseOptions: {
modelAssetPath:
'https://storage.googleapis.com/mediapipe-models/hand_landmarker/hand_landmarker/float16/1/hand_landmarker.task',
delegate: 'GPU', // 2–3× faster than CPU on most hardware
},
runningMode: 'VIDEO',
numHands: 1,
minHandDetectionConfidence: 0.5,
minHandPresenceConfidence: 0.5,
minTrackingConfidence: 0.5,
});
// Now grab the webcam.
const stream = await navigator.mediaDevices.getUserMedia({
video: { width: 640, height: 480, facingMode: 'user' },
audio: false,
});
videoEl.srcObject = stream;
await videoEl.play();
detectLoop();
}
Notice numHands: 1. We only need one. The model can do two but each one
costs another ~5 ms per frame.
The detection loop itself is one line plus a guard:
function detectLoop() {
if (videoEl.readyState >= 2) {
const result = handLandmarker.detectForVideo(videoEl, performance.now());
if (result.landmarks.length > 0) {
onHand(result.landmarks[0]); // 21 points
} else {
onNoHand();
}
}
requestAnimationFrame(detectLoop);
}
STEP 05From landmark to 3D cursor
MediaPipe gives you each landmark as { x, y, z } in image-normalized
coordinates. x goes right, y goes down, z is
relative depth (smaller = closer to camera). Origin top-left, all in [0, 1].
Three.js wants world coordinates where y goes up and the visible area is
bounded by the camera's frustum. Two conversions to remember:
- Mirror x. The webcam image isn't mirrored by default — but humans expect
their right hand to appear on the right. So we flip:
worldX = (1 - 2·lm.x) · halfFrustumWidth. - Flip y.
worldY = (1 - 2·lm.y) · halfFrustumHeight.
The frustum size at z = 0 for a perspective camera at distance
d with vertical FOV fov:
function frustumAt(z) {
const d = camera.position.z - z;
const h = 2 * d * Math.tan((camera.fov * Math.PI / 180) / 2);
const w = h * camera.aspect;
return { w, h };
}
function handToWorld(lm8) {
const { w, h } = frustumAt(0);
return new THREE.Vector3(
(1 - 2 * lm8.x) * w / 2,
(1 - 2 * lm8.y) * h / 2,
0, // for now — we'll add depth in step 6
);
}
That's the entire bridge. Hand goes in, world position comes out.
STEP 06One-Euro filter — the secret to "feels real"
Raw MediaPipe output jitters by 2–3 pixels per frame even when your hand is dead still.
Drop it on a 3D cursor and the cursor looks drunk. Naive lerp (prev = prev + (raw - prev) * 0.3)
fixes jitter but introduces lag — fast motion feels rubbery.
The One-Euro filter (Casiez et al., 2012) is the canonical fix. It uses a low-pass that adapts its cutoff based on velocity: still hand = heavy smoothing, fast hand = barely any. Forty lines, zero dependencies, transforms the entire feel of the demo.
class OneEuroFilter {
constructor(minCutoff = 1.0, beta = 0.0, dCutoff = 1.0) {
this.minCutoff = minCutoff;
this.beta = beta;
this.dCutoff = dCutoff;
this.x = null;
this.dx = 0;
this.lastTime = null;
}
filter(value, time) {
if (this.lastTime === null) {
this.x = value; this.lastTime = time; return value;
}
const dt = Math.max((time - this.lastTime) / 1000, 1e-6);
this.lastTime = time;
const dxRaw = (value - this.x) / dt;
const aD = this.alpha(dt, this.dCutoff);
this.dx = aD * dxRaw + (1 - aD) * this.dx;
const cutoff = this.minCutoff + this.beta * Math.abs(this.dx);
const a = this.alpha(dt, cutoff);
this.x = a * value + (1 - a) * this.x;
return this.x;
}
alpha(dt, cutoff) {
const r = 2 * Math.PI * cutoff * dt;
return r / (r + 1);
}
}
// One filter per axis
const fx = new OneEuroFilter(1.4, 0.015);
const fy = new OneEuroFilter(1.4, 0.015);
const fz = new OneEuroFilter(1.2, 0.010);
Tune minCutoff to control "still hand" smoothness (lower = smoother).
Tune beta to control "fast hand" responsiveness (higher = snappier).
Our defaults (1.4 / 0.015) are intentionally smooth — you can crank
beta to 0.05 for snappier flicks.
STEP 07Pinch detection — the gesture that earns the demo
The whole interaction model rests on one number: the 3D distance between landmark
4 (thumb tip) and landmark 8 (index tip), normalized by hand
size so it works regardless of how close your hand is to the camera.
function pinchAmount(landmarks) {
const a = landmarks[4]; // thumb tip
const b = landmarks[8]; // index tip
const dx = a.x - b.x, dy = a.y - b.y, dz = a.z - b.z;
const raw = Math.sqrt(dx*dx + dy*dy + dz*dz);
// Normalize by hand size = distance from wrist (0) to index MCP (5)
const wrist = landmarks[0], mcp = landmarks[5];
const handSize = Math.hypot(wrist.x - mcp.x, wrist.y - mcp.y, wrist.z - mcp.z);
return raw / handSize; // 0 = fingers touching, ~1 = open hand
}
We don't want a binary on/off — we want hysteresis. A clean pinch fires when the ratio
drops below 0.30. It releases when it climbs back above 0.45.
The gap between thresholds is what stops jittery on/off flickering when your fingers are
hovering near the threshold:
let pinching = false;
function updatePinch(amount) {
if (!pinching && amount < 0.30) pinching = true;
else if (pinching && amount > 0.45) pinching = false;
return pinching;
}
The hysteresis trick is everywhere in UX. Mouse drag thresholds, double-click windows, gamepad trigger pull — anywhere a continuous input drives a discrete state, use two thresholds with a gap. Otherwise the boundary jitters and the interaction feels broken.
STEP 08Hover, grab, drag — the state machine
Three states. One window pointer.
| State | Hand 1 | Hand 2 | What happens |
|---|---|---|---|
| IDLE | open | — | Closest window scales to 1.22×, soft halo. Others rest. |
| GRAB | closed (just transitioned) | — | Particle burst. Window locks to hand 1, render-order maxed. homePos.z bumps above all others (rerank). |
| DRAG | closed (held + moved) | — | Window XY follows hand 1, smoothed. |
| SCALE | closed | closed | Window resizes to baseScale × (interHandDist / startDist), clamped [0.4, 3.0]. Position follows midpoint of both hands. |
| OPEN | quick pinch + release (no drift) | — | Air-tap. Window centres + zooms to 2.4×, others dim to 30% and push back in z. ESC or a second air-tap closes. |
| RELEASE | open | any | Drag commit — window stays at new XY/scale, with a small "release" burst. |
Closest-window selection is a simple linear scan — eight windows, no spatial index needed.
function findHovered(cursor) {
let best = null, bestDist = SELECT_RADIUS;
for (const w of windows) {
const d = Math.hypot(w.position.x - cursor.x, w.position.y - cursor.y);
if (d < bestDist) { best = w; bestDist = d; }
}
return best;
}
The grab transition fires exactly once — on the frame pinch goes from open to closed. That's where you spawn the particle burst:
const wasPinching = pinching;
pinching = updatePinch(pinchAmount(lm));
if (!wasPinching && pinching && hovered) {
grabbed = hovered;
grabOffset.subVectors(grabbed.position, cursor);
spawnParticleBurst(cursor, hovered.userData.accent);
} else if (wasPinching && !pinching) {
grabbed = null;
}
if (grabbed) {
// Drag — follow cursor, keep z forward
grabbed.userData.targetPos.set(
cursor.x + grabOffset.x,
cursor.y + grabOffset.y,
0.6,
);
}
Notice targetPos — every window animates to a target each frame
(position.lerp(targetPos, 0.18)). Direct assignment would snap and feel rigid.
Lerping at 0.18 is the sweet spot: snappy enough to feel responsive, smooth
enough to look natural.
Air-tap → open / close
The same pinch gesture has to mean two different things: a slow pinch (held + moved) is a drag, and a quick pinch (closed and reopened in under ~280 ms without moving) is an air-tap. The classifier is two numbers:
const AIR_TAP_MAX_MS = 280;
const AIR_TAP_MAX_DRIFT = 0.10; // world units
// On pinch-down:
pinchStartTime = performance.now();
pinchStartPos.copy(cursor);
pinchMaxDrift = 0;
// Every frame while pinching:
if (pinching) {
pinchMaxDrift = Math.max(pinchMaxDrift, cursor.distanceTo(pinchStartPos));
}
// On pinch-up:
const dur = performance.now() - pinchStartTime;
const wasAirTap = dur < AIR_TAP_MAX_MS && pinchMaxDrift < AIR_TAP_MAX_DRIFT;
If wasAirTap is true, we don't commit a drag — we toggle the window's
openState. If the air-tap landed on empty space (no window hovered), we
use it to close whatever's currently open instead. Same gesture, dual purpose, no
button needed.
if (wasAirTap) {
if (grabbed) toggleOpen(grabbed);
else if (openedWindow) toggleOpen(openedWindow); // air-tap empty = close
} else if (grabbed) {
// regular release — commit the new position
}
The open pose
"Open" isn't fullscreen — it's a focus mode. The opened window animates to
(0, 0, 0.6) at scale ~2.4×. Every other window dims to 30%
opacity and pushes back in z. The whole scene gives the focused app the spotlight,
Mission-Control style.
// Per window, per frame:
win.userData.openAmount += (win.userData.openTarget - win.userData.openAmount) * 0.12;
const openA = win.userData.openAmount;
// Blend home position with centred target as openAmount rises
win.userData.targetPos.set(
homeX * (1 - openA) + 0 * openA,
homeY * (1 - openA) + 0 * openA,
hoverZ * (1 - openA) + 0.6 * openA - otherOpenAmount * 0.3,
);
// Scale and opacity:
const openBump = 1 + openA * 1.4; // up to 2.4×
const backScale = 1 - otherOpenAmount * 0.25;
win.scale.setScalar(userScale * selBump * openBump * backScale);
win.material.opacity = 1 - otherOpenAmount * 0.7;
The otherOpenAmount = "how open is any other window?". As one
window opens, the others dim and shrink proportionally — a single number drives the whole
background fade. Press ESC or air-tap anywhere to close.
Two-hand pinch-spread → scale
The same model is doing all the work. Flip numHands to 2, give
each hand its own One-Euro filters + cursor visuals, and you have everything you need for
a Vision-Pro-style resize gesture.
// In the MediaPipe options:
numHands: 2,
// Process both detected hands per frame:
if (result.landmarks.length > 0) onHand(result.landmarks[0]);
if (result.landmarks.length > 1) onHand2(result.landmarks[1]);
The scale state machine is tiny. Enter when hand 1 is already grabbing AND hand 2 starts pinching. Record the inter-hand distance + the window's current scale. While both hands stay pinched, update the scale by the distance ratio:
let scaling = false;
const scaleStart = { dist: 1, baseUserScale: 1 };
const canScale = grabbed && hand2Visible && pinching && pinching2;
if (canScale && !scaling) {
scaling = true;
scaleStart.dist = Math.max(cursor.distanceTo(cursor2), 0.05);
scaleStart.baseUserScale = grabbed.userData.userScale || 1;
spawnBurst(cursor2, grabbed.userData.type.accent, 24);
} else if (scaling && !canScale) {
scaling = false;
}
if (scaling) {
const dist = Math.max(cursor.distanceTo(cursor2), 0.05);
const ratio = dist / scaleStart.dist;
const target = THREE.MathUtils.clamp(
scaleStart.baseUserScale * ratio, 0.4, 3.0,
);
const cur = grabbed.userData.userScale ?? 1;
grabbed.userData.userScale = cur + (target - cur) * 0.25;
// Window centres between the two pinch points — Vision-Pro feel
const mid = new THREE.Vector3().addVectors(cursor, cursor2).multiplyScalar(0.5);
grabbed.userData.targetPos.x = mid.x;
grabbed.userData.targetPos.y = mid.y;
}
The smoothing step (cur + (target - cur) * 0.25) is what stops the window
from popping every frame. Without it, even a sub-millimetre jitter on either fingertip
shows up as a visible scale-pulse. With it, slow movements glide and fast movements feel
snappy. The clamp [0.4, 3.0] stops users from shrinking a window into oblivion
or growing it past the viewport.
Mouse + touchpad: scroll wheel = scale
Not every viewer will use two hands. Same code path, different input:
demoEl.addEventListener('wheel', (e) => {
const target = grabbed || hovered;
if (!target) return;
e.preventDefault();
// exp(-deltaY * k) gives symmetric zoom-in / zoom-out
const factor = Math.exp(-e.deltaY * 0.0015);
const cur = target.userData.userScale ?? 1;
target.userData.userScale = THREE.MathUtils.clamp(cur * factor, 0.4, 3.0);
}, { passive: false });
Why an exponential factor? Linear scaling (scale += deltaY * k) is asymmetric — scrolling up by N then down by N doesn't return you to the start, because the same delta represents a bigger ratio when the value is small.exp(-deltaY * k)makes "scroll up 100" and "scroll down 100" exactly cancel. Same trick map apps use for pinch-zoom.
STEP 09Visual wow — glow halos and particle bursts
The interaction logic is done. Everything from here on is what makes the demo screenshot-worthy.
Soft halo behind each window
A second plane, slightly larger than the window, behind it. It carries a radial-gradient
canvas texture tinted to the window's accent colour. Material is
AdditiveBlending, opacity tied to the window's "selection" value (0 → 1).
function makeHaloTexture() {
const c = document.createElement('canvas');
c.width = c.height = 256;
const ctx = c.getContext('2d');
const grad = ctx.createRadialGradient(128,128,0,128,128,128);
grad.addColorStop(0, 'rgba(255,255,255,1)');
grad.addColorStop(0.4, 'rgba(255,255,255,0.3)');
grad.addColorStop(1, 'rgba(255,255,255,0)');
ctx.fillStyle = grad;
ctx.fillRect(0,0,256,256);
return new THREE.CanvasTexture(c);
}
Reuse the same texture across all eight halos. Cheap. Looks expensive.
The hand cursor itself
A small sphere with an emissive material plus its own halo plus a trail of fading
spheres behind it. The cursor's colour shifts with state — cyan when idle, white when
hovering a window, hot orange when pinched. Each shift is a colour lerp
toward the target — never a snap.
const idleColor = new THREE.Color(0x22d3ee);
const hoverColor = new THREE.Color(0xffffff);
const grabColor = new THREE.Color(0xff8a3d);
const target = pinching ? grabColor : (hovered ? hoverColor : idleColor);
cursor.material.color.lerp(target, 0.18);
cursorHalo.material.color.lerp(target, 0.18);
The particle burst
A single THREE.Points object. Pre-allocate the worst-case number of
particles. Keep a pool. On grab, activate N of them at the cursor with
random outward velocities. Each frame, advance position, fade size + opacity. When a
particle's life hits zero, return it to the pool.
const MAX = 200;
const positions = new Float32Array(MAX * 3);
const velocities = new Float32Array(MAX * 3);
const lives = new Float32Array(MAX);
const sizes = new Float32Array(MAX);
const colors = new Float32Array(MAX * 3);
// On grab:
function spawnBurst(at, color) {
for (let n = 0; n < 40; n++) {
const i = findDead();
if (i === -1) break;
positions[i*3+0] = at.x;
positions[i*3+1] = at.y;
positions[i*3+2] = at.z;
const theta = Math.random() * Math.PI * 2;
const speed = 0.5 + Math.random() * 1.5;
velocities[i*3+0] = Math.cos(theta) * speed;
velocities[i*3+1] = Math.sin(theta) * speed;
velocities[i*3+2] = (Math.random() - 0.5) * 0.6;
lives[i] = 0.6 + Math.random() * 0.4;
// ...colour into colors[i*3+...]
}
}
The "wow" of the burst comes from three things: the count (40 is enough), the speed spread (some fast, some slow, all the difference), and the colour matching the window you grabbed.
STEP 10Mouse fallback & ship it
Not every viewer will enable their camera. The demo should still work. Map mouse position to the same cursor target. Left mouse button = pinch. Done. Same state machine, same effects.
demo.addEventListener('pointermove', (e) => {
if (cameraActive) return; // hand-tracker takes over once camera starts
const r = demo.getBoundingClientRect();
const nx = (e.clientX - r.left) / r.width;
const ny = (e.clientY - r.top) / r.height;
const { w, h } = frustumAt(0);
rawCursor.set(
(nx * 2 - 1) * w / 2,
(1 - ny * 2) * h / 2,
0,
);
});
demo.addEventListener('pointerdown', () => { mousePinching = true; });
demo.addEventListener('pointerup', () => { mousePinching = false; });
Mouse mode is on by default. Camera takes over when you click Start Camera.
Performance notes & gotchas
- HTTPS required for getUserMedia. Localhost is exempt. If your hosted demo doesn't get a camera prompt, that's the cause 9 times out of 10.
- Inference cost. Hand Landmarker is ~8–15 ms/frame on integrated GPU,
~25–40 ms on CPU. Always use
delegate: 'GPU'. - Detect every frame. Don't try to "save cycles" by running inference every other frame. The latency cost on a 60Hz display is more noticeable than the FPS gain.
- Texture updates. Animated window content sets
texture.needsUpdate = trueevery frame. Static windows draw once and never touch the texture again. - Mirror the video element with CSS, not WebGL.
transform: scaleX(-1)on the<video>in the mini-camera. The landmarks themselves stay unmirrored — we mirror in the world-coords math (step 5). - Dispose on unmount. Single-page demos leak GPU memory if you navigate away
without calling
geometry.dispose()/material.dispose()/renderer.dispose(). We do it in the demo's cleanup listener. - Hand jitter when occluded. If your hand goes off-screen, the model still
guesses for a few frames. Use the
onNoHandbranch to decay the cursor toward idle and release any grab.
Exercises
- Rotate with two hands. Scale is the magnitude change between the two pinch
points. Rotation is the angle change. Capture
atan2of the vector between the two cursors at scale-start, and apply the delta as a Z rotation on the grabbed window. Now you have a full manipulate gesture (move + scale + rotate). - Window stack. Sort windows by z when one's grabbed — fan them out into a Mac Mission Control–style layout.
- Real iframe content. Replace one CanvasTexture window with a
CSS3DRendereriframe so it shows an actual page. Bonus: pinch-and-pan that iframe. - Voice commands. Wire
SpeechRecognitionalongside hand tracking. Say "expand" while pinching → window fills the screen. - Persistence. Save window positions to
localStorageso layouts survive a refresh. - Air-tap. Add a quick-pinch gesture (open → closed → open within 250ms) that fires a click instead of a grab. Now you have buttons.
What's next
Project P14 is intentionally single-page. Plug the same hand-tracking core into:
- S10-06 — MediaPipe Tracking — the concept reference
- S3-09 — Facial Capture — same MediaPipe stack, faces instead of hands