Three.js From Zero · Article s11-05
S11-05 MediaPipe Face/Pose + Three.js
MediaPipe Face/Pose + Three.js — Try-On at Home
Warby Parker, Ray-Ban, Persol, Oakley, Tom Ford. Every eyewear retailer over $30M ARR ships virtual try-on. The stack is MediaPipe Face Landmarker + Three.js with PD auto-correction, anchored to the bridge of the nose. The same architecture, with Pose Landmarker, anchors a sneaker to your foot.
1. Why try-on is the highest-conversion-lift configurator
Industry data from Warby Parker, Lenskart, and Smartzer: virtual try-on lifts conversion 30-40% on eyewear, reduces returns 15-25%, and on average doubles average session duration. The reason is simple — buying glasses is a face-shape problem, and the customer is the only one with the face.
The equivalent for sneakers (Nike By You, On's foot tracker): 15-25% conversion lift, lower because shoes are less face-personal but still significantly more visual than a flat product page.
2. The full pipeline
Webcam (getUserMedia)
↓
HTMLVideoElement
↓
MediaPipe FaceLandmarker / PoseLandmarker (WASM)
↓
468 face landmarks (or 33 pose landmarks)
↓
Anchor math: derive position + orientation
↓
Three.js scene: glasses model parented to anchor
↓
Composited canvas (video + Three.js scene over)
The key insight: the video element is rendered as the background of a transparent Three.js canvas. The 3D model floats over the user's face in the same canvas. From the user's perspective it's one image — webcam + glasses.
3. Live demo — the illustrative version
Running the full MediaPipe pipeline inside an article demo is rough on shared mobile/desktop hardware (50 MB WASM, 30+ FPS face landmarker). So this demo uses a mouse-tracked anchor — the cursor simulates the bridge of your nose, and the glasses follow it. The scaling, orientation, and anchoring math is identical to the production pipeline; only the "where is the face" signal is faked.
For the production version, swap getMouseAnchor() for getMediaPipeBridgeAnchor(). The rest of the code is unchanged. We'll show that swap below.
4. Browser permissions — the shape of getUserMedia
Before MediaPipe ever runs, the browser asks for camera access. This is the one part of the pipeline that mobile users abandon at — about 35-45% of cold visitors deny permission. Three best practices:
- Don't auto-trigger. Always require an explicit user click. Browsers throw promise rejections for geolocation/camera requested without user activation.
- Pre-explain. A modal saying "we'll use your camera to overlay glasses, nothing leaves your device" lifts grant rate from 55% to 75%.
- Failure path. If denied, fall back to a still-photo upload or a 3D-only viewer. Never dead-end the user.
async function startCamera() {
try {
const stream = await navigator.mediaDevices.getUserMedia({
video: { facingMode: 'user', width: 640, height: 480 },
audio: false,
});
videoEl.srcObject = stream;
await videoEl.play();
} catch (err) {
showFallbackUploader();
}
}
5. MediaPipe FaceLandmarker — setup
The MediaPipe Tasks API ships as a WASM bundle. Total weight: ~50 MB for the face landmarker, ~14 MB for pose landmarker. Both are loaded lazily after camera permission is granted.
import { FaceLandmarker, FilesetResolver } from
'https://cdn.jsdelivr.net/npm/@mediapipe/[email protected]';
const fileset = await FilesetResolver.forVisionTasks(
'https://cdn.jsdelivr.net/npm/@mediapipe/[email protected]/wasm'
);
const faceLandmarker = await FaceLandmarker.createFromOptions(fileset, {
baseOptions: {
modelAssetPath: '/models/face_landmarker.task',
delegate: 'GPU', // critical — CPU is 2-3x slower
},
runningMode: 'VIDEO',
numFaces: 1,
outputFaceBlendshapes: false, // skip if you don't need expressions
outputFacialTransformationMatrixes: true, // we WANT this — head pose
});
delegate: 'GPU'. CPU mode runs at 8-15 fps on a 2020 MacBook. GPU mode runs at 30-60 fps. The visual quality difference is night-and-day.6. The 468 face landmarks — what's where
MediaPipe returns 468 3D landmarks per face. Each landmark is a normalized x/y/z coordinate (0-1 image space, z is relative depth). For glasses, we care about:
| Landmark indices | Anatomy | Use |
|---|---|---|
| 168, 6, 197, 195, 5, 4, 1 | Nose bridge / nose tip | Anchor point |
| 33, 133 | Left eye outer/inner corner | Width + PD measurement |
| 362, 263 | Right eye outer/inner corner | Width + PD measurement |
| 234, 454 | Left ear, right ear | Frame width sanity check |
| 10, 152 | Forehead, chin | Face height (for pose context) |
For glasses, the anchor is landmark 168 (the bridge of the nose, just below where eyebrows would meet). Landmarks 33 and 263 give us PD.
7. PD — pupillary distance auto-correction
PD is the distance between the pupils, in millimeters. It varies from 54mm (children) to 74mm (large adult). Glasses fit relative to PD. If you scale a glasses model uniformly, it'll be too narrow for some users and too wide for others.
// Compute PD in pixels, then convert to mm via face width
const leftEye = landmarks[33]; // outer left
const rightEye = landmarks[263]; // outer right
const pdPx = Math.hypot(
(leftEye.x - rightEye.x) * imgW,
(leftEye.y - rightEye.y) * imgH
);
// Average PD : average inter-temple width = ~63mm : ~140mm
const faceWidthPx = Math.hypot(
(landmarks[234].x - landmarks[454].x) * imgW,
(landmarks[234].y - landmarks[454].y) * imgH
);
const mmPerPx = 140 / faceWidthPx; // approx face width 140mm
const pdMm = pdPx * mmPerPx;
glasses.scale.setScalar(pdMm / 63);
Scale uniformly by pdMm / 63 (where 63mm is the model's authored PD). This is the single most important detail — without it, frames look too small on big faces and too clown-y on small faces.
8. Anchor math — bridge plane orientation
Position alone isn't enough. The glasses need to tilt with head yaw and pitch. The trick: form a coordinate frame from three landmarks and use it as the anchor's basis.
// Build a frame: nose-bridge as origin, eye-line as X axis, up from
// crown to chin as Y axis (negated)
const bridge = vec3(landmarks[168]);
const leftEye = vec3(landmarks[33]);
const rightEye = vec3(landmarks[263]);
const chin = vec3(landmarks[152]);
const xAxis = leftEye.clone().sub(rightEye).normalize();
const yAxisRaw = bridge.clone().sub(chin).normalize();
const zAxis = new THREE.Vector3().crossVectors(xAxis, yAxisRaw).normalize();
const yAxis = new THREE.Vector3().crossVectors(zAxis, xAxis).normalize();
const m = new THREE.Matrix4().makeBasis(xAxis, yAxis, zAxis);
glasses.quaternion.setFromRotationMatrix(m);
glasses.position.copy(bridge);
If you've got outputFacialTransformationMatrixes: true, MediaPipe gives you this matrix directly — a 4×4 matrix per face. Use it. Don't reinvent the basis math when MediaPipe ships it.
9. Mobile front camera vs back camera
For face try-on, you want the front camera — facingMode: 'user'. For pose-based shoe try-on, you want the back camera (facingMode: 'environment') and the user has to point the camera at their feet.
| Try-on type | Camera | Tracker | Anchor |
|---|---|---|---|
| Glasses | front | FaceLandmarker | landmark 168 (bridge) |
| Earrings | front | FaceLandmarker | landmarks 234 / 454 (ears) |
| Hat | front | FaceLandmarker | landmark 10 + scale up |
| Watch | back | HandLandmarker | wrist landmark |
| Sneaker | back | PoseLandmarker | landmarks 27, 31 (left ankle / foot) |
10. The sneaker variant — same architecture, different landmarks
Pose Landmarker returns 33 body landmarks. For a single-foot sneaker try-on:
// Pose landmarks 27-32 are feet/ankles
const leftAnkle = poseLandmarks[27];
const leftHeel = poseLandmarks[29];
const leftToe = poseLandmarks[31];
// Same basis math as glasses
const xAxis = leftToe.clone().sub(leftHeel).normalize();
// ... derive Y, Z axes
sneaker.position.copy(leftAnkle);
sneaker.quaternion.setFromRotationMatrix(m);
Sneaker authoring requires authored model footprint — the model's local origin should be the ankle joint, with +X along toe direction.
11. Lighting estimation hint
The glasses look fake if they're brightly lit and the user is in a dim room (or vice versa). Sample the average brightness of the video frame and bias the directional light intensity:
// Cheap exposure hint — sample 16 pixels from video frame
function sampleVideoBrightness(video) {
ctx.drawImage(video, 0, 0, 32, 32);
const data = ctx.getImageData(0, 0, 32, 32).data;
let sum = 0;
for (let i = 0; i < data.length; i += 4) {
sum += (data[i] + data[i+1] + data[i+2]) / 3;
}
return sum / (32 * 32 * 255); // 0-1 brightness
}
keyLight.intensity = 0.6 + sampleVideoBrightness(video) * 1.4;
12. Fallbacks
- No camera grant. Show a single still-image uploader; run face detection on that one frame; place glasses over the static photo. 60% of denied users will accept the photo flow.
- Face out of frame. Show a "look at the camera" hint overlay.
- Tracking jitter. Smooth landmarks with a one-Euro filter (linked in the references).
- Browser too old. WASM fallback with `delegate: 'CPU'` for older Safari versions.
13. Production performance numbers
| Device | FPS, MediaPipe + Three.js | Initial load |
|---|---|---|
| M1 MacBook Pro | 60 | 1.1s + 0.8s WASM |
| 2020 Intel MacBook | 30-45 | 1.5s + 1.4s WASM |
| iPhone 13 (Safari) | 30 | 2.0s + 1.5s WASM |
| Pixel 6 (Chrome) | 30 | 1.6s + 1.2s WASM |
| 2018 Android budget | 15-20 | 4s + 3s WASM |
The 50MB MediaPipe model is the biggest bottleneck — gzip + cache hint + service worker for repeat visits. Lazy-load after permission grant.
14. Takeaways
- Try-on lifts conversion 30-40% on eyewear. The pipeline isn't optional for retailers over $30M.
- MediaPipe FaceLandmarker → 468 landmarks; landmark 168 is your anchor for glasses.
- PD (landmark 33 to 263 distance) drives uniform scale. Without it, frames look wrong on every other face.
- Use
outputFacialTransformationMatrixes: true— MediaPipe gives you the 4×4 directly. Don't reinvent the basis math. - Always specify
delegate: 'GPU'. CPU mode is unusable. - Pose Landmarker → sneakers with the exact same architecture; just swap landmarks.
- Lazy-load the 50MB WASM model after the user grants camera permission.