Three.js From Zero · Article s10-05

S10-05 TTS + Lip Sync

Season 10 · Article 05

TTS + Lip Sync

Browser says text. Character's mouth moves with it. SpeechSynthesis API + amplitude analysis + morph target driving. Talking head in 30 lines.

1. Browser speech synthesis

const utter = new SpeechSynthesisUtterance("Hello world");
utter.rate = 1.0;
utter.pitch = 1.0;
utter.voice = speechSynthesis.getVoices().find(v => v.lang.startsWith('en'));
speechSynthesis.speak(utter);

Free, no API key, 50+ languages. Voice quality varies by platform.

2. Capture audio for analysis

SpeechSynthesis plays directly. To analyze, you need to record via Web Audio or use a cloud TTS that returns audio buffer.

// Option A (cloud TTS returns buffer):
const buffer = await fetch('/tts', { method: 'POST', body: text }).then(r => r.arrayBuffer());
const audioBuffer = await ctx.decodeAudioData(buffer);
const source = ctx.createBufferSource();
source.buffer = audioBuffer;
const analyser = ctx.createAnalyser();
source.connect(analyser).connect(ctx.destination);
source.start();

3. Amplitude-driven jaw

Simplest: analyze overall amplitude, drive jaw-open morph target.

const data = new Uint8Array(analyser.fftSize);
function tick() {
  analyser.getByteTimeDomainData(data);
  let sum = 0;
  for (const v of data) sum += Math.abs(v - 128);
  const level = sum / data.length / 128;  // 0..1
  headMesh.morphTargetInfluences[JAW_OPEN] = level * 2;
}

4. Viseme mapping (accurate)

Each phoneme maps to a mouth shape (viseme). English has ~14 visemes.

AA (hat) → open wide
EE (see) → flat smile
OO (ooze) → rounded
MM (lip close) → lips together
FF/VV → lower lip on upper teeth
…12 more.

5. Getting visemes

Azure TTS: returns viseme events with timings. Easy.
Amazon Polly: same — speech marks include visemes.
Rhubarb Lip Sync (offline): .wav → viseme JSON. Free.
OVR LipSync: real-time from audio. Meta.

6. ARKit blendshapes

iOS / Metahuman standard: 52 blendshapes. Map visemes to subsets:

VISEME_TO_BLENDS = {
  'AA': { jawOpen: 0.8, mouthOpen: 0.6 },
  'EE': { mouthSmileLeft: 0.5, mouthSmileRight: 0.5, jawOpen: 0.2 },
  'OO': { mouthPucker: 0.8, jawOpen: 0.3 },
  // ...
};

7. Smooth transitions

Jumping between visemes looks robotic. Blend over ~80ms:

for (const blend in target) {
  current[blend] = lerp(current[blend] || 0, target[blend], 0.3);
  mesh.morphTargetInfluences[blendIndex[blend]] = current[blend];
}

8. Live demo — amplitude-driven mouth

Type, click Speak. Procedural "mouth" (square) opens with audio amplitude via SpeechSynthesis.

–

9. Takeaways

Free: browser SpeechSynthesis.
Better: cloud TTS with viseme events (Azure, Polly).
Amplitude-driven = cheap mouth open/close.
Viseme-driven = realistic mouth shapes.
Smooth transitions between visemes via lerp.
ARKit 52 blendshapes + viseme map = Metahuman-tier talking head.