Three.js From Zero · Article s10-06

S10-06 MediaPipe Tracking

Season 10 · Article 06

MediaPipe Body + Face Tracking

Google's MediaPipe runs in-browser. Your webcam feeds landmark detection. Points drive Three.js bones. Avatar mirrors you in real time.

1. MediaPipe models

Model	Landmarks	Use
Pose Landmarker	33 body points	Full-body avatar
Hand Landmarker	21 per hand	Finger-level VR interaction
Face Landmarker	478 face points	Facial capture
Face Blendshapes	52 ARKit shapes	Drive morph targets
Holistic	Body + hands + face	Everything

2. Setup

import { FilesetResolver, PoseLandmarker } from '@mediapipe/tasks-vision';

const vision = await FilesetResolver.forVisionTasks(
  'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision/wasm'
);
const pose = await PoseLandmarker.createFromOptions(vision, {
  baseOptions: { modelAssetPath: 'https://cdn.jsdelivr.net/.../pose_landmarker_lite.task' },
  runningMode: 'VIDEO',
  numPoses: 1,
});

// Per frame:
const result = pose.detectForVideo(videoEl, performance.now());
// result.landmarks[0] = array of 33 {x, y, z} points

3. Landmark → bone rotation

For each bone (e.g., upper arm), compute rotation from direction between two landmarks:

function pointAt(bone, from, to) {
  const dir = new THREE.Vector3().subVectors(to, from).normalize();
  const q = new THREE.Quaternion().setFromUnitVectors(new THREE.Vector3(0, 1, 0), dir);
  bone.quaternion.copy(q);
}
pointAt(leftUpperArm, landmarks[LEFT_SHOULDER], landmarks[LEFT_ELBOW]);
pointAt(leftForearm,  landmarks[LEFT_ELBOW],    landmarks[LEFT_WRIST]);

This is the naive version. Production needs IK cleanup + smoothing.

4. Face blendshapes → morph targets

const face = await FaceLandmarker.createFromOptions(vision, {
  baseOptions: { modelAssetPath: '...face_landmarker.task' },
  outputFaceBlendshapes: true,
  runningMode: 'VIDEO',
});
const result = face.detectForVideo(videoEl, performance.now());
for (const blend of result.faceBlendshapes[0].categories) {
  const idx = headMesh.morphTargetDictionary[blend.categoryName];
  if (idx !== undefined) headMesh.morphTargetInfluences[idx] = blend.score;
}

MediaPipe outputs are named with ARKit conventions — plug straight into a Ready Player Me or Metahuman avatar.

5. Smoothing

Raw landmarks jitter. One-Euro filter or low-pass:

const smoothed = prev.lerp(raw, 0.35);  // per-axis

6. VRM integration

three-vrm = VRM avatar loader. Combine with MediaPipe → VTuber in 100 lines.

import { VRMLoaderPlugin } from '@pixiv/three-vrm';
const loader = new GLTFLoader();
loader.register(p => new VRMLoaderPlugin(p));
const gltf = await loader.loadAsync('model.vrm');
const vrm = gltf.userData.vrm;
// Drive vrm.humanoid.getNormalizedBone('upperArm_L') from MediaPipe

7. Perf

Pose lite model: 30+ fps on phone.
Holistic (body+hands+face): 15-20fps, heavier.
GPU delegate: 2× speedup.

8. Use cases

VTubing from webcam.
Fitness apps with exercise recognition.
Accessible controls (head-pose-driven UI).
Sign language interpretation.
Gesture shortcuts in XR.

9. Takeaways

MediaPipe Tasks Vision = in-browser CV.
Pose / hand / face landmarkers, each with named points.
Face Blendshapes output matches ARKit — direct to morphs.
Pose landmarks → bone rotations via vector math.
Pair with three-vrm for VTuber pipelines.
Smooth raw output with low-pass or One-Euro filter.

Concept article — a full MediaPipe-in-browser demo needs webcam permission and MediaPipe WASM. See S3-09 for a working facial capture demo.