Three.js From Zero · Article s10-06
S10-06 MediaPipe Tracking
Season 10 · Article 06
MediaPipe Body + Face Tracking
Google's MediaPipe runs in-browser. Your webcam feeds landmark detection. Points drive Three.js bones. Avatar mirrors you in real time.
1. MediaPipe models
| Model | Landmarks | Use |
|---|---|---|
| Pose Landmarker | 33 body points | Full-body avatar |
| Hand Landmarker | 21 per hand | Finger-level VR interaction |
| Face Landmarker | 478 face points | Facial capture |
| Face Blendshapes | 52 ARKit shapes | Drive morph targets |
| Holistic | Body + hands + face | Everything |
2. Setup
import { FilesetResolver, PoseLandmarker } from '@mediapipe/tasks-vision';
const vision = await FilesetResolver.forVisionTasks(
'https://cdn.jsdelivr.net/npm/@mediapipe/tasks-vision/wasm'
);
const pose = await PoseLandmarker.createFromOptions(vision, {
baseOptions: { modelAssetPath: 'https://cdn.jsdelivr.net/.../pose_landmarker_lite.task' },
runningMode: 'VIDEO',
numPoses: 1,
});
// Per frame:
const result = pose.detectForVideo(videoEl, performance.now());
// result.landmarks[0] = array of 33 {x, y, z} points
3. Landmark → bone rotation
For each bone (e.g., upper arm), compute rotation from direction between two landmarks:
function pointAt(bone, from, to) {
const dir = new THREE.Vector3().subVectors(to, from).normalize();
const q = new THREE.Quaternion().setFromUnitVectors(new THREE.Vector3(0, 1, 0), dir);
bone.quaternion.copy(q);
}
pointAt(leftUpperArm, landmarks[LEFT_SHOULDER], landmarks[LEFT_ELBOW]);
pointAt(leftForearm, landmarks[LEFT_ELBOW], landmarks[LEFT_WRIST]);
This is the naive version. Production needs IK cleanup + smoothing.
4. Face blendshapes → morph targets
const face = await FaceLandmarker.createFromOptions(vision, {
baseOptions: { modelAssetPath: '...face_landmarker.task' },
outputFaceBlendshapes: true,
runningMode: 'VIDEO',
});
const result = face.detectForVideo(videoEl, performance.now());
for (const blend of result.faceBlendshapes[0].categories) {
const idx = headMesh.morphTargetDictionary[blend.categoryName];
if (idx !== undefined) headMesh.morphTargetInfluences[idx] = blend.score;
}
MediaPipe outputs are named with ARKit conventions — plug straight into a Ready Player Me or Metahuman avatar.
5. Smoothing
Raw landmarks jitter. One-Euro filter or low-pass:
const smoothed = prev.lerp(raw, 0.35); // per-axis
6. VRM integration
three-vrm = VRM avatar loader. Combine with MediaPipe → VTuber in 100 lines.
import { VRMLoaderPlugin } from '@pixiv/three-vrm';
const loader = new GLTFLoader();
loader.register(p => new VRMLoaderPlugin(p));
const gltf = await loader.loadAsync('model.vrm');
const vrm = gltf.userData.vrm;
// Drive vrm.humanoid.getNormalizedBone('upperArm_L') from MediaPipe
7. Perf
- Pose lite model: 30+ fps on phone.
- Holistic (body+hands+face): 15-20fps, heavier.
- GPU delegate: 2× speedup.
8. Use cases
- VTubing from webcam.
- Fitness apps with exercise recognition.
- Accessible controls (head-pose-driven UI).
- Sign language interpretation.
- Gesture shortcuts in XR.
9. Takeaways
- MediaPipe Tasks Vision = in-browser CV.
- Pose / hand / face landmarkers, each with named points.
- Face Blendshapes output matches ARKit — direct to morphs.
- Pose landmarks → bone rotations via vector math.
- Pair with three-vrm for VTuber pipelines.
- Smooth raw output with low-pass or One-Euro filter.
Concept article — a full MediaPipe-in-browser demo needs webcam permission and MediaPipe WASM. See S3-09 for a working facial capture demo.