AI Features and Models Powering Mirror Mirror

Please describe the list of AI features and models and techniques that are at play from end-to-end to make mirror mirror happen.

Here's the full inventory of AI features, models, and techniques powering Mirror Mirror end-to-end:


1. Face Detection & Landmark Tracking — MediaPipe Face Landmarker

  • Model: face_landmarker.task (float16), loaded from @mediapipe/tasks-vision
  • What it does: Detects 478 3D face landmarks in real-time from the webcam. Used in two places:
    • Capture phase — guides the user through 5 poses (center, left, right, up, down) and extracts landmark geometry + facial transformation matrices per pose.
    • Gaze tracking — runs continuously during the viewer to track the user's head pose (pitch/yaw) and drive the 3D model rotation + expression deformation in real time.

2. Monocular Depth Estimation — Depth Anything V2 Small

  • Model: depth-anything/Depth-Anything-V2-Small-hf via HuggingFace Transformers + PyTorch
  • What it does: Takes each captured JPEG frame, infers a per-pixel depth map, normalizes it to an 8-bit 256×256 grayscale PNG. This gives the 3D mesh actual Z-depth so the face isn't flat — vertices are displaced in Z by sampling this depth map.

3. 3D Face Reconstruction — Custom multi-pose fusion pipeline

  • Technique: Combines MediaPipe's 474 landmark positions from the center pose with the depth map to produce a 3D vertex mesh. UV coordinates are computed from the face's image-space position, and a texture atlas is stitched from all 5 captured angles (center front panel + 4 side panels for left/right/up/down). Triangulation uses MediaPipe's canonical face topology (FACE_TRIANGLES + EYE_FILL_TRIANGLES).
  • Math: Rigid-body 4×4 matrix inversion, matrix multiplication, and point transformation to un-project landmarks into 3D world space.

4. Real-time 3D Rendering — WebGPU

  • Custom WGSL vertex/fragment shaders (face.vert.wgsl, face.frag.wgsl) rendering the textured face mesh with per-vertex normals and lighting.
  • Supports dynamic vertex updates every frame for expression deformation.

5. Expression Deformation — Procrustes Alignment + Landmark Deltas

  • Technique: 2D Procrustes analysis (closed-form similarity transform: translate + rotate + uniform scale) aligns live landmarks to rest-pose landmarks using 8 bony anchor points. Per-landmark deltas are computed, EMA-smoothed (α=0.35), then converted to vertex-space displacements. This drives real-time facial expressions on the 3D model from webcam input.

6. Voice Cloning — ElevenLabs Voice Add API

  • API: POST /v1/voices/add
  • What it does: During calibration, the user speaks an incantation. The recorded audio is uploaded to ElevenLabs, which creates a cloned voice profile and returns a voice_id. This voice is then used for all TTS responses — the mirror literally speaks back in your voice.

7. Text-to-Speech with Timestamps — ElevenLabs Multilingual V2

  • Model: eleven_multilingual_v2
  • API: POST /v1/text-to-speech/{voice_id}/with-timestamps
  • What it does: Synthesizes the mirror's response text into audio using the cloned voice, and returns character-level start/end timestamps. These timestamps are critical for lip-sync.

8. Speech-to-Text — ElevenLabs Scribe V1

  • Model: scribe_v1
  • API: POST /v1/speech-to-text
  • What it does: Transcribes the user's spoken question into text so it can be sent to the LLM.

9. Conversational AI — Claude (Anthropic)

  • Model: claude-opus-4-6
  • API: Anthropic Messages API (POST /v1/messages)
  • System prompt: "You are the reflection in this mirror, speaking back to the face looking at you. Respond in one short sentence. Be direct, honest, and a little surprising — like something your own face might actually say."
  • What it does: Generates the mirror's spoken response. Maintains multi-turn conversation history so the mirror has context.

10. Viseme-based Lip Sync — Custom phoneme→viseme pipeline

  • Technique: Rule-based English text-to-phoneme conversion (digraph + single-char rules → 13 phoneme classes: REST, MBP, FV, TH, TDNL, SZ, SH, R, KG, W, OH, AH, EE). Each phoneme maps to a viseme (mouth shape). During calibration, the incantation's known phoneme sequence is time-aligned to the captured landmark frames, extracting a representative deformation delta per viseme. At playback, ElevenLabs' character timestamps are merged with the phoneme sequence to produce a timed viseme timeline, and the face mesh is deformed frame-by-frame with lerp blending (t=0.7) to animate the mouth in sync with audio.

11. Voice Activity Detection (VAD) — Web Audio API + RMS threshold

  • Technique: AnalyserNode computes RMS energy from the microphone stream in real-time. Once speech is detected (RMS > 0.01 threshold), a 1.5s silence timer triggers automatic recording stop. This gives natural conversational turn-taking — the mirror knows when you've stopped speaking.

12. Motion Control — Multi-source input fusion

  • Technique: Fuses gaze tracking (highest priority), device orientation (gyroscope), and mouse/touch drag to control the 3D model's rotation. Uses exponential smoothing (1 - e^(-speed·dt)) for fluid interpolation, with per-source lerp speeds and priority arbitration.

13. Persistence — IndexedDB

  • Scans (landmark bundles, JPEG textures, depth data) and voice data (cloned voice IDs, per-viseme deformation shapes) are persisted client-side in IndexedDB, with schema migrations across 3 versions.

In summary: The user looks into their camera. MediaPipe tracks their face. Depth Anything gives it depth. WebGPU renders a 3D replica. They speak an incantation — ElevenLabs clones their voice while Procrustes extracts their mouth shapes. They ask a question — Scribe transcribes it, Claude answers, ElevenLabs speaks the answer in their own cloned voice, and the 3D face lip-syncs using the captured viseme shapes. The mirror talks back.

Zozo:anthropic/claude-opus-4-6
10w