Three.js From Zero · Article s10-05
S10-05 TTS + Lip Sync
Season 10 · Article 05
TTS + Lip Sync
Browser says text. Character's mouth moves with it. SpeechSynthesis API + amplitude analysis + morph target driving. Talking head in 30 lines.
1. Browser speech synthesis
const utter = new SpeechSynthesisUtterance("Hello world");
utter.rate = 1.0;
utter.pitch = 1.0;
utter.voice = speechSynthesis.getVoices().find(v => v.lang.startsWith('en'));
speechSynthesis.speak(utter);
Free, no API key, 50+ languages. Voice quality varies by platform.
2. Capture audio for analysis
SpeechSynthesis plays directly. To analyze, you need to record via Web Audio or use a cloud TTS that returns audio buffer.
// Option A (cloud TTS returns buffer):
const buffer = await fetch('/tts', { method: 'POST', body: text }).then(r => r.arrayBuffer());
const audioBuffer = await ctx.decodeAudioData(buffer);
const source = ctx.createBufferSource();
source.buffer = audioBuffer;
const analyser = ctx.createAnalyser();
source.connect(analyser).connect(ctx.destination);
source.start();
3. Amplitude-driven jaw
Simplest: analyze overall amplitude, drive jaw-open morph target.
const data = new Uint8Array(analyser.fftSize);
function tick() {
analyser.getByteTimeDomainData(data);
let sum = 0;
for (const v of data) sum += Math.abs(v - 128);
const level = sum / data.length / 128; // 0..1
headMesh.morphTargetInfluences[JAW_OPEN] = level * 2;
}
4. Viseme mapping (accurate)
Each phoneme maps to a mouth shape (viseme). English has ~14 visemes.
- AA (hat) → open wide
- EE (see) → flat smile
- OO (ooze) → rounded
- MM (lip close) → lips together
- FF/VV → lower lip on upper teeth
- …12 more.
5. Getting visemes
- Azure TTS: returns viseme events with timings. Easy.
- Amazon Polly: same — speech marks include visemes.
- Rhubarb Lip Sync (offline): .wav → viseme JSON. Free.
- OVR LipSync: real-time from audio. Meta.
6. ARKit blendshapes
iOS / Metahuman standard: 52 blendshapes. Map visemes to subsets:
VISEME_TO_BLENDS = {
'AA': { jawOpen: 0.8, mouthOpen: 0.6 },
'EE': { mouthSmileLeft: 0.5, mouthSmileRight: 0.5, jawOpen: 0.2 },
'OO': { mouthPucker: 0.8, jawOpen: 0.3 },
// ...
};
7. Smooth transitions
Jumping between visemes looks robotic. Blend over ~80ms:
for (const blend in target) {
current[blend] = lerp(current[blend] || 0, target[blend], 0.3);
mesh.morphTargetInfluences[blendIndex[blend]] = current[blend];
}
8. Live demo — amplitude-driven mouth
Type, click Speak. Procedural "mouth" (square) opens with audio amplitude via SpeechSynthesis.
–
9. Takeaways
- Free: browser SpeechSynthesis.
- Better: cloud TTS with viseme events (Azure, Polly).
- Amplitude-driven = cheap mouth open/close.
- Viseme-driven = realistic mouth shapes.
- Smooth transitions between visemes via lerp.
- ARKit 52 blendshapes + viseme map = Metahuman-tier talking head.