Three.js From Zero · Article s11-05

S11-05 MediaPipe Face/Pose + Three.js

Season 11 · Article 05

MediaPipe Face/Pose + Three.js — Try-On at Home

Warby Parker, Ray-Ban, Persol, Oakley, Tom Ford. Every eyewear retailer over $30M ARR ships virtual try-on. The stack is MediaPipe Face Landmarker + Three.js with PD auto-correction, anchored to the bridge of the nose. The same architecture, with Pose Landmarker, anchors a sneaker to your foot.

1. Why try-on is the highest-conversion-lift configurator

Industry data from Warby Parker, Lenskart, and Smartzer: virtual try-on lifts conversion 30-40% on eyewear, reduces returns 15-25%, and on average doubles average session duration. The reason is simple — buying glasses is a face-shape problem, and the customer is the only one with the face.

The equivalent for sneakers (Nike By You, On's foot tracker): 15-25% conversion lift, lower because shoes are less face-personal but still significantly more visual than a flat product page.

2. The full pipeline

Webcam (getUserMedia)
   ↓
HTMLVideoElement
   ↓
MediaPipe FaceLandmarker / PoseLandmarker (WASM)
   ↓
468 face landmarks (or 33 pose landmarks)
   ↓
Anchor math: derive position + orientation
   ↓
Three.js scene: glasses model parented to anchor
   ↓
Composited canvas (video + Three.js scene over)

The key insight: the video element is rendered as the background of a transparent Three.js canvas. The 3D model floats over the user's face in the same canvas. From the user's perspective it's one image — webcam + glasses.

3. Live demo — the illustrative version

Running the full MediaPipe pipeline inside an article demo is rough on shared mobile/desktop hardware (50 MB WASM, 30+ FPS face landmarker). So this demo uses a mouse-tracked anchor — the cursor simulates the bridge of your nose, and the glasses follow it. The scaling, orientation, and anchoring math is identical to the production pipeline; only the "where is the face" signal is faked.

For the production version, swap getMouseAnchor() for getMediaPipeBridgeAnchor(). The rest of the code is unchanged. We'll show that swap below.

–

move the mouse — pretend it's your nose bridge

frame frame color lens tint PD (mm) 63 face yaw 0°

4. Browser permissions — the shape of getUserMedia

Before MediaPipe ever runs, the browser asks for camera access. This is the one part of the pipeline that mobile users abandon at — about 35-45% of cold visitors deny permission. Three best practices:

Don't auto-trigger. Always require an explicit user click. Browsers throw promise rejections for geolocation/camera requested without user activation.
Pre-explain. A modal saying "we'll use your camera to overlay glasses, nothing leaves your device" lifts grant rate from 55% to 75%.
Failure path. If denied, fall back to a still-photo upload or a 3D-only viewer. Never dead-end the user.

async function startCamera() {
  try {
    const stream = await navigator.mediaDevices.getUserMedia({
      video: { facingMode: 'user', width: 640, height: 480 },
      audio: false,
    });
    videoEl.srcObject = stream;
    await videoEl.play();
  } catch (err) {
    showFallbackUploader();
  }
}

5. MediaPipe FaceLandmarker — setup

The MediaPipe Tasks API ships as a WASM bundle. Total weight: ~50 MB for the face landmarker, ~14 MB for pose landmarker. Both are loaded lazily after camera permission is granted.

import { FaceLandmarker, FilesetResolver } from
  'https://cdn.jsdelivr.net/npm/@mediapipe/[email protected]';

const fileset = await FilesetResolver.forVisionTasks(
  'https://cdn.jsdelivr.net/npm/@mediapipe/[email protected]/wasm'
);
const faceLandmarker = await FaceLandmarker.createFromOptions(fileset, {
  baseOptions: {
    modelAssetPath: '/models/face_landmarker.task',
    delegate: 'GPU',           // critical — CPU is 2-3x slower
  },
  runningMode: 'VIDEO',
  numFaces: 1,
  outputFaceBlendshapes: false,   // skip if you don't need expressions
  outputFacialTransformationMatrixes: true,  // we WANT this — head pose
});

Always specify delegate: 'GPU'. CPU mode runs at 8-15 fps on a 2020 MacBook. GPU mode runs at 30-60 fps. The visual quality difference is night-and-day.

6. The 468 face landmarks — what's where

MediaPipe returns 468 3D landmarks per face. Each landmark is a normalized x/y/z coordinate (0-1 image space, z is relative depth). For glasses, we care about:

Landmark indices	Anatomy	Use
168, 6, 197, 195, 5, 4, 1	Nose bridge / nose tip	Anchor point
33, 133	Left eye outer/inner corner	Width + PD measurement
362, 263	Right eye outer/inner corner	Width + PD measurement
234, 454	Left ear, right ear	Frame width sanity check
10, 152	Forehead, chin	Face height (for pose context)

For glasses, the anchor is landmark 168 (the bridge of the nose, just below where eyebrows would meet). Landmarks 33 and 263 give us PD.

7. PD — pupillary distance auto-correction

PD is the distance between the pupils, in millimeters. It varies from 54mm (children) to 74mm (large adult). Glasses fit relative to PD. If you scale a glasses model uniformly, it'll be too narrow for some users and too wide for others.

// Compute PD in pixels, then convert to mm via face width
const leftEye  = landmarks[33];      // outer left
const rightEye = landmarks[263];     // outer right
const pdPx = Math.hypot(
  (leftEye.x - rightEye.x) * imgW,
  (leftEye.y - rightEye.y) * imgH
);
// Average PD : average inter-temple width = ~63mm : ~140mm
const faceWidthPx = Math.hypot(
  (landmarks[234].x - landmarks[454].x) * imgW,
  (landmarks[234].y - landmarks[454].y) * imgH
);
const mmPerPx = 140 / faceWidthPx;        // approx face width 140mm
const pdMm = pdPx * mmPerPx;
glasses.scale.setScalar(pdMm / 63);

Scale uniformly by pdMm / 63 (where 63mm is the model's authored PD). This is the single most important detail — without it, frames look too small on big faces and too clown-y on small faces.

8. Anchor math — bridge plane orientation

Position alone isn't enough. The glasses need to tilt with head yaw and pitch. The trick: form a coordinate frame from three landmarks and use it as the anchor's basis.

// Build a frame: nose-bridge as origin, eye-line as X axis, up from
// crown to chin as Y axis (negated)
const bridge = vec3(landmarks[168]);
const leftEye  = vec3(landmarks[33]);
const rightEye = vec3(landmarks[263]);
const chin     = vec3(landmarks[152]);

const xAxis = leftEye.clone().sub(rightEye).normalize();
const yAxisRaw = bridge.clone().sub(chin).normalize();
const zAxis = new THREE.Vector3().crossVectors(xAxis, yAxisRaw).normalize();
const yAxis = new THREE.Vector3().crossVectors(zAxis, xAxis).normalize();

const m = new THREE.Matrix4().makeBasis(xAxis, yAxis, zAxis);
glasses.quaternion.setFromRotationMatrix(m);
glasses.position.copy(bridge);

If you've got outputFacialTransformationMatrixes: true, MediaPipe gives you this matrix directly — a 4×4 matrix per face. Use it. Don't reinvent the basis math when MediaPipe ships it.

9. Mobile front camera vs back camera

For face try-on, you want the front camera — facingMode: 'user'. For pose-based shoe try-on, you want the back camera (facingMode: 'environment') and the user has to point the camera at their feet.

Try-on type	Camera	Tracker	Anchor
Glasses	front	FaceLandmarker	landmark 168 (bridge)
Earrings	front	FaceLandmarker	landmarks 234 / 454 (ears)
Hat	front	FaceLandmarker	landmark 10 + scale up
Watch	back	HandLandmarker	wrist landmark
Sneaker	back	PoseLandmarker	landmarks 27, 31 (left ankle / foot)

10. The sneaker variant — same architecture, different landmarks

Pose Landmarker returns 33 body landmarks. For a single-foot sneaker try-on:

// Pose landmarks 27-32 are feet/ankles
const leftAnkle = poseLandmarks[27];
const leftHeel  = poseLandmarks[29];
const leftToe   = poseLandmarks[31];

// Same basis math as glasses
const xAxis = leftToe.clone().sub(leftHeel).normalize();
// ... derive Y, Z axes
sneaker.position.copy(leftAnkle);
sneaker.quaternion.setFromRotationMatrix(m);

Sneaker authoring requires authored model footprint — the model's local origin should be the ankle joint, with +X along toe direction.

11. Lighting estimation hint

The glasses look fake if they're brightly lit and the user is in a dim room (or vice versa). Sample the average brightness of the video frame and bias the directional light intensity:

// Cheap exposure hint — sample 16 pixels from video frame
function sampleVideoBrightness(video) {
  ctx.drawImage(video, 0, 0, 32, 32);
  const data = ctx.getImageData(0, 0, 32, 32).data;
  let sum = 0;
  for (let i = 0; i < data.length; i += 4) {
    sum += (data[i] + data[i+1] + data[i+2]) / 3;
  }
  return sum / (32 * 32 * 255);   // 0-1 brightness
}
keyLight.intensity = 0.6 + sampleVideoBrightness(video) * 1.4;

12. Fallbacks

No camera grant. Show a single still-image uploader; run face detection on that one frame; place glasses over the static photo. 60% of denied users will accept the photo flow.
Face out of frame. Show a "look at the camera" hint overlay.
Tracking jitter. Smooth landmarks with a one-Euro filter (linked in the references).
Browser too old. WASM fallback with `delegate: 'CPU'` for older Safari versions.

13. Production performance numbers

Device	FPS, MediaPipe + Three.js	Initial load
M1 MacBook Pro	60	1.1s + 0.8s WASM
2020 Intel MacBook	30-45	1.5s + 1.4s WASM
iPhone 13 (Safari)	30	2.0s + 1.5s WASM
Pixel 6 (Chrome)	30	1.6s + 1.2s WASM
2018 Android budget	15-20	4s + 3s WASM

The 50MB MediaPipe model is the biggest bottleneck — gzip + cache hint + service worker for repeat visits. Lazy-load after permission grant.

14. Takeaways

Try-on lifts conversion 30-40% on eyewear. The pipeline isn't optional for retailers over $30M.
MediaPipe FaceLandmarker → 468 landmarks; landmark 168 is your anchor for glasses.
PD (landmark 33 to 263 distance) drives uniform scale. Without it, frames look wrong on every other face.
Use outputFacialTransformationMatrixes: true — MediaPipe gives you the 4×4 directly. Don't reinvent the basis math.
Always specify delegate: 'GPU'. CPU mode is unusable.
Pose Landmarker → sneakers with the exact same architecture; just swap landmarks.
Lazy-load the 50MB WASM model after the user grants camera permission.