Three.js From Zero · Article s10-06

S10-06 MediaPipe Tracking

Season 10 · Article 06

MediaPipe Body + Face Tracking

Google's MediaPipe runs in-browser. Your webcam feeds landmark detection. Points drive Three.js bones. Avatar mirrors you in real time.

1. MediaPipe models

ModelLandmarksUse
Pose Landmarker33 body pointsFull-body avatar
Hand Landmarker21 per handFinger-level VR interaction
Face Landmarker478 face pointsFacial capture
Face Blendshapes52 ARKit shapesDrive morph targets
HolisticBody + hands + faceEverything

2. Setup

import { FilesetResolver, PoseLandmarker } from '@mediapipe/tasks-vision';

const vision = await FilesetResolver.forVisionTasks(
  'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision/wasm'
);
const pose = await PoseLandmarker.createFromOptions(vision, {
  baseOptions: { modelAssetPath: 'https://cdn.jsdelivr.net/.../pose_landmarker_lite.task' },
  runningMode: 'VIDEO',
  numPoses: 1,
});

// Per frame:
const result = pose.detectForVideo(videoEl, performance.now());
// result.landmarks[0] = array of 33 {x, y, z} points

3. Landmark → bone rotation

For each bone (e.g., upper arm), compute rotation from direction between two landmarks:

function pointAt(bone, from, to) {
  const dir = new THREE.Vector3().subVectors(to, from).normalize();
  const q = new THREE.Quaternion().setFromUnitVectors(new THREE.Vector3(0, 1, 0), dir);
  bone.quaternion.copy(q);
}
pointAt(leftUpperArm, landmarks[LEFT_SHOULDER], landmarks[LEFT_ELBOW]);
pointAt(leftForearm,  landmarks[LEFT_ELBOW],    landmarks[LEFT_WRIST]);

This is the naive version. Production needs IK cleanup + smoothing.

4. Face blendshapes → morph targets

const face = await FaceLandmarker.createFromOptions(vision, {
  baseOptions: { modelAssetPath: '...face_landmarker.task' },
  outputFaceBlendshapes: true,
  runningMode: 'VIDEO',
});
const result = face.detectForVideo(videoEl, performance.now());
for (const blend of result.faceBlendshapes[0].categories) {
  const idx = headMesh.morphTargetDictionary[blend.categoryName];
  if (idx !== undefined) headMesh.morphTargetInfluences[idx] = blend.score;
}

MediaPipe outputs are named with ARKit conventions — plug straight into a Ready Player Me or Metahuman avatar.

5. Smoothing

Raw landmarks jitter. One-Euro filter or low-pass:

const smoothed = prev.lerp(raw, 0.35);  // per-axis

6. VRM integration

three-vrm = VRM avatar loader. Combine with MediaPipe → VTuber in 100 lines.

import { VRMLoaderPlugin } from '@pixiv/three-vrm';
const loader = new GLTFLoader();
loader.register(p => new VRMLoaderPlugin(p));
const gltf = await loader.loadAsync('model.vrm');
const vrm = gltf.userData.vrm;
// Drive vrm.humanoid.getNormalizedBone('upperArm_L') from MediaPipe

7. Perf

  • Pose lite model: 30+ fps on phone.
  • Holistic (body+hands+face): 15-20fps, heavier.
  • GPU delegate: 2× speedup.

8. Use cases

  • VTubing from webcam.
  • Fitness apps with exercise recognition.
  • Accessible controls (head-pose-driven UI).
  • Sign language interpretation.
  • Gesture shortcuts in XR.

9. Takeaways

  • MediaPipe Tasks Vision = in-browser CV.
  • Pose / hand / face landmarkers, each with named points.
  • Face Blendshapes output matches ARKit — direct to morphs.
  • Pose landmarks → bone rotations via vector math.
  • Pair with three-vrm for VTuber pipelines.
  • Smooth raw output with low-pass or One-Euro filter.

Concept article — a full MediaPipe-in-browser demo needs webcam permission and MediaPipe WASM. See S3-09 for a working facial capture demo.