Three.js From Zero · Article s3-09

Facial Capture with Webcam

← threejs-from-zeroS3 · Article 09 Season 3
Article S3-09 · Three.js From Zero

Facial Capture with Webcam

Your webcam sees your face. MediaPipe (Google's ML framework) extracts 468 landmarks from the video feed — precise points on eyes, mouth, chin, forehead. Map those landmarks to ARKit blendshapes (from S3-02). Drive a 3D character's face in real time, all client-side, no servers, no apps.

This unlocks browser-based performance capture. VTubers, social avatars, virtual meetings with expressive avatars. The whole stack is free.

The demo runs live if you grant camera permission. On the left: webcam feed with landmarks drawn over your face. On the right: a procedural 3D face driven by your expression. Smile, blink, open your mouth.

WEBCAM + LANDMARKS
camera not started
CHARACTER

What MediaPipe gives you

MediaPipe FaceLandmarker is Google's real-time face ML model, shipped for the web as a WASM package. Input: a video frame. Output per frame:

  • 468 face landmarks — 3D points (x, y, z normalized to image space)
  • 52 ARKit-compatible blendshape scores — 0..1 weights per shape (mouthSmileLeft, eyeBlinkRight, jawOpen, etc.)
  • Face transformation matrix — rigid head pose (rotation + translation)

The 52 blendshapes are the money feature. Google pre-trained a model that outputs ARKit-named weights directly — no need for you to interpret landmarks. Pipe those weights straight into mesh.morphTargetInfluences.

Loading the model

import { FilesetResolver, FaceLandmarker } from
  'https://cdn.jsdelivr.net/npm/@mediapipe/[email protected]/vision_bundle.mjs';

const vision = await FilesetResolver.forVisionTasks(
  'https://cdn.jsdelivr.net/npm/@mediapipe/[email protected]/wasm'
);

const faceLandmarker = await FaceLandmarker.createFromOptions(vision, {
  baseOptions: {
    modelAssetPath: 'https://storage.googleapis.com/mediapipe-models/face_landmarker/face_landmarker/float16/1/face_landmarker.task',
    delegate: 'GPU',   // use WebGPU/WebGL if available
  },
  outputFaceBlendshapes: true,
  outputFacialTransformationMatrixes: true,
  runningMode: 'VIDEO',
  numFaces: 1,
});

The camera loop

const video = document.getElementById('video');
const stream = await navigator.mediaDevices.getUserMedia({
  video: { width: 640, height: 480 }, audio: false,
});
video.srcObject = stream;
await video.play();

function detectLoop() {
  const timeMs = performance.now();
  const result = faceLandmarker.detectForVideo(video, timeMs);

  if (result.faceBlendshapes.length > 0) {
    const shapes = result.faceBlendshapes[0].categories;
    // shapes is an array of { categoryName: 'jawOpen', score: 0.42 }
    applyToCharacter(shapes);
  }
  if (result.facialTransformationMatrixes.length > 0) {
    const m = result.facialTransformationMatrixes[0].data;
    applyHeadPose(m);
  }

  requestAnimationFrame(detectLoop);
}
detectLoop();

Applying blendshapes to Three.js

function applyToCharacter(shapes) {
  for (const s of shapes) {
    const idx = headMesh.morphTargetDictionary[s.categoryName];
    if (idx !== undefined) {
      // Smooth — exponential damper on the value
      const prev = headMesh.morphTargetInfluences[idx];
      headMesh.morphTargetInfluences[idx] = prev + (s.score - prev) * 0.5;
    }
  }
}

The smoothing is important — raw ML output has noise at the single-pixel level. 0.5 lerp per frame is snappy but smooth. Drop to 0.25 for more stability.

The 52 shape names (same as ARKit)

browDownLeft, browDownRight, browInnerUp, browOuterUpLeft, browOuterUpRight,
cheekPuff, cheekSquintLeft, cheekSquintRight,
eyeBlinkLeft, eyeBlinkRight,
eyeLookDownLeft, eyeLookDownRight, eyeLookInLeft, eyeLookInRight,
eyeLookOutLeft, eyeLookOutRight, eyeLookUpLeft, eyeLookUpRight,
eyeSquintLeft, eyeSquintRight, eyeWideLeft, eyeWideRight,
jawForward, jawLeft, jawOpen, jawRight,
mouthClose, mouthDimpleLeft, mouthDimpleRight,
mouthFrownLeft, mouthFrownRight, mouthFunnel, mouthLeft, mouthLowerDownLeft,
mouthLowerDownRight, mouthPressLeft, mouthPressRight, mouthPucker, mouthRight,
mouthRollLower, mouthRollUpper, mouthShrugLower, mouthShrugUpper,
mouthSmileLeft, mouthSmileRight, mouthStretchLeft, mouthStretchRight,
mouthUpperUpLeft, mouthUpperUpRight,
noseSneerLeft, noseSneerRight, tongueOut

If your character has any of these, MediaPipe drives it. Missing shapes are silently skipped. Ready Player Me avatars ship all 52 — plug and play.

Head pose — the rotation matrix

function applyHeadPose(matData) {
  // matData is a Float32Array(16) — column-major 4x4 matrix
  const mat = new THREE.Matrix4().fromArray(matData);

  // Decompose into pos / rot / scale
  const pos = new THREE.Vector3();
  const quat = new THREE.Quaternion();
  const scale = new THREE.Vector3();
  mat.decompose(pos, quat, scale);

  // Apply the rotation to the head bone — smooth it
  headBone.quaternion.slerp(quat, 0.2);
}

The matrix data is in MediaPipe's coordinate space — you may need to flip axes for your character (mirror left/right, negate Y for screen space). Test with head nods/shakes to calibrate.

Neck flexion — not just head rotation

A natural head turn bends the neck too, not just rotates the head on a pivot. Split the rotation between the head bone (70%) and the neck bone (30%) for a more organic feel:

neckBone.quaternion.slerp(new THREE.Quaternion().copy(quat).invert().invert(), 0.3 * 0.3);
// Actually: create a "partial rotation" that's a fraction of full rotation
const identity = new THREE.Quaternion();
neckBone.quaternion.slerpQuaternions(identity, quat, 0.3);
headBone.quaternion.slerpQuaternions(identity, quat, 0.7);

Latency + perf

Typical
FaceLandmarker inference (GPU)~10-20 ms/frame
FaceLandmarker inference (CPU)~30-50 ms/frame
Camera latency (webcam → video element)~50-100 ms
End-to-end (smile to see it on avatar)~80-150 ms

Fast enough for solo VTubing. For live performance with others watching, 150ms shows up as a slight lag but doesn't feel broken.

Hand tracking — same framework

MediaPipe also ships HandLandmarker. Same API. Drive character fingers from your hands. Combine with face for full upper-body capture. Add PoseLandmarker for full-body mocap from webcam.

Gotchas in production

  • Lighting matters. Dim, backlit, or glare → noisy landmarks. Diffuse front-light is ideal.
  • Mirror the video. Most webcam UIs mirror — MediaPipe doesn't by default. Flip horizontally in the rendering layer, or mirror the blendshape names (swap *Left ↔ *Right).
  • Smoothing trade-off. Too little → twitchy. Too much → robotic lag. 0.3-0.5 lerp per frame is the sweet spot.
  • Blink detection is tricky. The model sometimes over-reports eyeBlink when you look down. Add a threshold: only trigger if value > 0.6.
  • Browser compatibility. MediaPipe web needs a modern browser with WASM + getUserMedia. Works in Chrome, Edge, Safari 14+, Firefox latest.
  • Camera permission UX. Always behind a button click. Browsers reject auto-start.

Alternative: ONNX Runtime with a custom model

For fully custom face models (stylized shapes, artist-driven specifics), train your own and run via onnxruntime-web. More work but unlimited output flexibility.

Common first-time pitfalls

  • Camera doesn't start — site isn't HTTPS. Localhost exempt.
  • Landmarks but no blendshapes — forgot outputFaceBlendshapes: true.
  • Character mirrors your face instead of matching — mirror left/right blendshape pairs.
  • Very low frame rate — ensure delegate: 'GPU'. CPU delegate is 3x slower.
  • Expression stays after I stop making it — you're not clamping blendshapes back to 0 on no-detection frames. Add a "decay toward 0" for any shape the new frame didn't report.

Exercises

  1. Record + replay: capture 10 seconds of blendshape output to an array. Play it back driving the character later, without camera.
  2. Full upper-body: add HandLandmarker → drive character's hand bones. Add PoseLandmarker → drive shoulder + torso.
  3. Lip sync from audio + face: blend mic volume into jawOpen as a fallback when face detection is low confidence.

What's next

S3-10 — The Full Character Rig. Season 3 finale. Everything from S3 combined: skinned mesh + morphs + blend tree + IK + procedural + physics reactions + facial capture. One playable character, end-to-end.