Article S3-07 · Three.js From Zero

Motion Matching & Root Motion

Blend trees (S3-03) work but have limits. You hand-author axes, hand-place anchors, hand-tune transitions. It scales to walk/run. It doesn't scale to a huge library of contextual movements — turn-in-place, jump-over-waist-high-wall, vault, land-from-10m.

Motion matching replaces blend trees with a nearest-neighbor search every 6-10 frames. You have a database of every animation clip, sliced into tiny windows. Each frame the system looks at the character's current velocity, future desired trajectory, and body pose — and picks the clip that matches best. No state machine. No blend tree. Just "what clip in my library looks most like where I am right now and where I'm headed?"

The demo shows motion matching on a 2D dot that moves toward wherever you click. A library of ~12 synthetic "clips" with varied trajectories. Each tick, the system scores every clip against the current desired trajectory. Best match plays. Click different spots and watch the chosen clip change.

initializing…

search freq trajectory weight velocity weight pose weight

The history: how we got here

Motion matching hit AAA mainstream around 2016 (Ubisoft's "For Honor", then EA's "FIFA", then Horizon Zero Dawn). Academic paper: Motion Matching and The Road to Next-Gen Animation by Simon Clavet (GDC 2016). Before that: blend trees + state machines + thousands of hand-tuned transitions. After: you dump all your mocap into a big pile and the system figures it out.

The pose-to-pose concept

At a high level, motion matching compares two feature vectors:

Current state — what the character is doing now + wants to do next
Every clip frame — what a clip looks like at each time point

The clip frame whose feature vector is closest (lowest weighted distance) to the current state becomes the new playing clip. The character cuts to that clip, continues playing from that point, and the cycle repeats.

The feature vector

Classic motion-matching features (from the Clavet paper):

Feature	What	Weight
Future trajectory	Root position at T+0.2s, T+0.4s, T+0.6s (in character's local frame)	High
Future orientation	Facing direction at same intervals	High
Root velocity	Current linear velocity	Medium
Foot positions	Left foot + right foot relative to root	Low-Medium (continuity)
Foot velocities	Linear velocity of each foot	Medium (avoid mid-step jumps)

Concatenate into one big float vector (say 24 floats). Every frame of every clip gets a feature vector too — computed offline. At runtime: distance check between current state feature and every candidate frame's feature.

The search — nearest neighbor

function search(currentFeatures, clipDB, weights) {
  let best = null, bestDist = Infinity;
  for (const frame of clipDB) {
    const dist = weightedDistance(currentFeatures, frame.features, weights);
    if (dist < bestDist) { bestDist = dist; best = frame; }
  }
  return best;
}

function weightedDistance(a, b, w) {
  let s = 0;
  for (let i = 0; i < a.length; i++) {
    s += w[i] * (a[i] - b[i]) ** 2;
  }
  return s;
}

O(N × D) where N = clip frames, D = feature dimension. For 10,000 frames × 24 dimensions at 10Hz search rate → 2.4M ops per second. Cheap. No need for KD-trees unless your DB is millions of frames.

"Desired trajectory" — where are we going?

The trajectory feature compares clip frames against where the player wants to be, not where the character currently is. You need to predict or specify the player's trajectory 0.6s into the future.

function predictTrajectory(currentPos, currentVel, inputStick, dt) {
  // Exponential damper — current velocity drifts toward desired
  const TARGET_VEL = inputStick.clone().multiplyScalar(MAX_SPEED);
  const blend = 1 - Math.exp(-dt / HALF_LIFE);
  const predictedVel = currentVel.clone().lerp(TARGET_VEL, blend);
  const predictedPos = currentPos.clone().addScaledVector(predictedVel, dt);
  return { pos: predictedPos, vel: predictedVel };
}

// Sample 3 future points
const t1 = predictTrajectory(pos, vel, input, 0.2);
const t2 = predictTrajectory(t1.pos, t1.vel, input, 0.2);
const t3 = predictTrajectory(t2.pos, t2.vel, input, 0.2);

This gives you "where I'll be in 0.2s, 0.4s, 0.6s if I keep pressing what I'm pressing". Clip frames encode the SAME thing — they know their own trajectory looking forward from that frame. Match against that.

Search frequency — throttle

You don't search every frame. Every 6-10 frames (about 10Hz) is plenty. At 10Hz:

The character has time to play a little bit of the chosen clip before being re-evaluated
Lower CPU cost
Smoother (no ping-ponging between near-tied candidates every frame)

Between searches, just advance the current clip normally. On a re-search, if the best match is within the currently playing clip at a similar frame, keep playing — don't cut. Only cut if the new best is meaningfully better or from a different clip.

Blending on switch — 0.2s crossfade

When motion match decides to cut to a new clip, don't hard-cut. Crossfade over ~0.2s (S1-05 pattern). The match already rewarded frames that continue the motion so crossfades tend to be tiny and natural.

function cutTo(clip, frame) {
  const next = mixer.clipAction(clip);
  next.reset().time = frame * (1 / 30);   // seek to feature frame
  next.fadeIn(0.2).play();
  currentAction?.fadeOut(0.2);
  currentAction = next;
}

Root motion — the indispensable partner

Motion matching depends on root motion. Each clip frame carries "how far the character moves over the next 0.6s" — that's derived from root-motion data in the clip. Without root motion, you can't predict forward trajectories, can't compare to player input.

Workflow: export mocap with root motion baked in. Sample the root bone's position + orientation at regular intervals. Store per-frame velocity + future offsets in the feature vector.

The motion library

For a production character you need:

Idle (~30s looping with small fidgets)
Walk cycle (forward, strafe, back — 8 directions)
Run cycle (8 directions)
Sprint (limited directions)
Stops and starts (acceleration/deceleration)
Turn-in-place (left/right 90° + 180°)
Jumps (takeoffs + lands at various heights)
Transitions between all of the above (mocap dense edges)

Total: 20-60 minutes of mocap. That sounds like a lot but for a main character in a 60-hour game, it's a week of shoot. Much less than hand-authoring every transition.

Contextual tags

Each clip can carry tags — "combat", "sneak", "injured", "swimming". The search filters to clips matching the current context before comparing features. That lets one database carry wildly different movement styles.

Limitations + what replaces them

Limitation	Solution
Needs a big mocap library	Procedural gen on top of base clips (Ubisoft's "Ghost Recon" approach)
Doesn't handle interaction with environment automatically	Layer IK on top — feet plant, hands reach
Can "miss" and play something weird	Tighten feature weights, add constraint filters
Memory — every clip frame × feature vector	Compress features (quantize 8-bit), use delta encoding

Libraries that do this for you

Unreal Engine 5 ships native motion matching (replaces state machines in many games)
Unity's Motion Matching package (experimental, preview)
Kinematica (Unity) — open-source research project
Learned Motion Matching (neural variant, Holden et al 2020) — learns a latent space to replace the database

Nothing ships in Three.js specifically. You'd roll your own, likely with offline feature extraction from glTF clips and a runtime search in JS or WASM.

Common first-time pitfalls

Character flickers between clips. Search rate too high, feature weights badly balanced. Drop to 5-10Hz, increase velocity/trajectory weights.
Character commits to a direction and then plays a mismatch. Trajectory prediction isn't updating fast enough. Re-predict each frame even if you don't re-search.
Transitions are choppy. Not crossfading on cut. Add 0.15-0.25s fadeIn.
Foot slides. Root motion doesn't match the actual locomotion. Pair with foot IK (S3-04).
Turns feel floaty. Orientation features underweighted. Increase the weight on future-facing direction.
Database too big / slow search. Tag-filter first, then search within the subset.

Exercises

Implement a 2D toy version (like the demo): N clips, each a list of (position, velocity) at T+0, T+0.2, T+0.4. Control character with mouse, live-score all clips.
Add tag filtering: "walking" vs "sprinting" clips. Flag a context switch via keyboard.
Foot-IK pass: layer S3-04 foot IK on top so feet plant even when the matched clip has slight root drift.

What's next

S3-08 — Mixamo Pipeline. End-to-end workflow for getting a rigged character into Three.js: Mixamo auto-rig, clip retargeting, batch glTF conversion, scene integration.