Here's the full inventory of AI features, models, and techniques powering Mirror Mirror end-to-end:
1. Face Detection & Landmark Tracking — MediaPipe Face Landmarker
- Model:
face_landmarker.task (float16), loaded from @mediapipe/tasks-vision
- What it does: Detects 478 3D face landmarks in real-time from the webcam. Used in two places:
- Capture phase — guides the user through 5 poses (center, left, right, up, down) and extracts landmark geometry + facial transformation matrices per pose.
- Gaze tracking — runs continuously during the viewer to track the user's head pose (pitch/yaw) and drive the 3D model rotation + expression deformation in real time.
2. Monocular Depth Estimation — Depth Anything V2 Small
- Model:
depth-anything/Depth-Anything-V2-Small-hf via HuggingFace Transformers + PyTorch
- What it does: Takes each captured JPEG frame, infers a per-pixel depth map, normalizes it to an 8-bit 256×256 grayscale PNG. This gives the 3D mesh actual Z-depth so the face isn't flat — vertices are displaced in Z by sampling this depth map.
3. 3D Face Reconstruction — Custom multi-pose fusion pipeline
- Technique: Combines MediaPipe's 474 landmark positions from the center pose with the depth map to produce a 3D vertex mesh. UV coordinates are computed from the face's image-space position, and a texture atlas is stitched from all 5 captured angles (center front panel + 4 side panels for left/right/up/down). Triangulation uses MediaPipe's canonical face topology (
FACE_TRIANGLES + EYE_FILL_TRIANGLES).
- Math: Rigid-body 4×4 matrix inversion, matrix multiplication, and point transformation to un-project landmarks into 3D world space.
4. Real-time 3D Rendering — WebGPU
- Custom WGSL vertex/fragment shaders (
face.vert.wgsl, face.frag.wgsl) rendering the textured face mesh with per-vertex normals and lighting.
- Supports dynamic vertex updates every frame for expression deformation.
5. Expression Deformation — Procrustes Alignment + Landmark Deltas
- Technique: 2D Procrustes analysis (closed-form similarity transform: translate + rotate + uniform scale) aligns live landmarks to rest-pose landmarks using 8 bony anchor points. Per-landmark deltas are computed, EMA-smoothed (α=0.35), then converted to vertex-space displacements. This drives real-time facial expressions on the 3D model from webcam input.
6. Voice Cloning — ElevenLabs Voice Add API
- API:
POST /v1/voices/add
- What it does: During calibration, the user speaks an incantation. The recorded audio is uploaded to ElevenLabs, which creates a cloned voice profile and returns a
voice_id. This voice is then used for all TTS responses — the mirror literally speaks back in your voice.
7. Text-to-Speech with Timestamps — ElevenLabs Multilingual V2
- Model:
eleven_multilingual_v2
- API:
POST /v1/text-to-speech/{voice_id}/with-timestamps
- What it does: Synthesizes the mirror's response text into audio using the cloned voice, and returns character-level start/end timestamps. These timestamps are critical for lip-sync.
8. Speech-to-Text — ElevenLabs Scribe V1
- Model:
scribe_v1
- API:
POST /v1/speech-to-text
- What it does: Transcribes the user's spoken question into text so it can be sent to the LLM.
9. Conversational AI — Claude (Anthropic)
- Model:
claude-opus-4-6
- API: Anthropic Messages API (
POST /v1/messages)
- System prompt: "You are the reflection in this mirror, speaking back to the face looking at you. Respond in one short sentence. Be direct, honest, and a little surprising — like something your own face might actually say."
- What it does: Generates the mirror's spoken response. Maintains multi-turn conversation history so the mirror has context.
10. Viseme-based Lip Sync — Custom phoneme→viseme pipeline
- Technique: Rule-based English text-to-phoneme conversion (digraph + single-char rules → 13 phoneme classes: REST, MBP, FV, TH, TDNL, SZ, SH, R, KG, W, OH, AH, EE). Each phoneme maps to a viseme (mouth shape). During calibration, the incantation's known phoneme sequence is time-aligned to the captured landmark frames, extracting a representative deformation delta per viseme. At playback, ElevenLabs' character timestamps are merged with the phoneme sequence to produce a timed viseme timeline, and the face mesh is deformed frame-by-frame with lerp blending (t=0.7) to animate the mouth in sync with audio.
11. Voice Activity Detection (VAD) — Web Audio API + RMS threshold
- Technique:
AnalyserNode computes RMS energy from the microphone stream in real-time. Once speech is detected (RMS > 0.01 threshold), a 1.5s silence timer triggers automatic recording stop. This gives natural conversational turn-taking — the mirror knows when you've stopped speaking.
12. Motion Control — Multi-source input fusion
- Technique: Fuses gaze tracking (highest priority), device orientation (gyroscope), and mouse/touch drag to control the 3D model's rotation. Uses exponential smoothing (
1 - e^(-speed·dt)) for fluid interpolation, with per-source lerp speeds and priority arbitration.
13. Persistence — IndexedDB
- Scans (landmark bundles, JPEG textures, depth data) and voice data (cloned voice IDs, per-viseme deformation shapes) are persisted client-side in IndexedDB, with schema migrations across 3 versions.
In summary: The user looks into their camera. MediaPipe tracks their face. Depth Anything gives it depth. WebGPU renders a 3D replica. They speak an incantation — ElevenLabs clones their voice while Procrustes extracts their mouth shapes. They ask a question — Scribe transcribes it, Claude answers, ElevenLabs speaks the answer in their own cloned voice, and the 3D face lip-syncs using the captured viseme shapes. The mirror talks back.