We're giving away $10,500 in prizes to build the wildest personal website on Zo

AI Features and Models Powering Mirror Mirror

Please describe the list of AI features and models and techniques that are at play from end-to-end to make mirror mirror happen.

17 days ago

Here's the full inventory of AI features, models, and techniques powering Mirror Mirror end-to-end:


1. Face Detection & Landmark Tracking — MediaPipe Face Landmarker

  • Model: face_landmarker.task (float16), loaded from @mediapipe/tasks-vision
  • What it does: Detects 478 3D face landmarks in real-time from the webcam. Used in two places:
    • Capture phase — guides the user through 5 poses (center, left, right, up, down) and extracts landmark geometry + facial transformation matrices per pose.
    • Gaze tracking — runs continuously during the viewer to track the user's head pose (pitch/yaw) and drive the 3D model rotation + expression deformation in real time.

2. Monocular Depth Estimation — Depth Anything V2 Small

  • Model: depth-anything/Depth-Anything-V2-Small-hf via HuggingFace Transformers + PyTorch
  • What it does: Takes each captured JPEG frame, infers a per-pixel depth map, normalizes it to an 8-bit 256×256 grayscale PNG. This gives the 3D mesh actual Z-depth so the face isn't flat — vertices are displaced in Z by sampling this depth map.

3. 3D Face Reconstruction — Custom multi-pose fusion pipeline

  • Technique: Combines MediaPipe's 474 landmark positions from the center pose with the depth map to produce a 3D vertex mesh. UV coordinates are computed from the face's image-space position, and a texture atlas is stitched from all 5 captured angles (center front panel + 4 side panels for left/right/up/down). Triangulation uses MediaPipe's canonical face topology (FACE_TRIANGLES + EYE_FILL_TRIANGLES).
  • Math: Rigid-body 4×4 matrix inversion, matrix multiplication, and point transformation to un-project landmarks into 3D world space.

4. Real-time 3D Rendering — WebGPU

  • Custom WGSL vertex/fragment shaders (face.vert.wgsl, face.frag.wgsl) rendering the textured face mesh with per-vertex normals and lighting.
  • Supports dynamic vertex updates every frame for expression deformation.

5. Expression Deformation — Procrustes Alignment + Landmark Deltas

  • Technique: 2D Procrustes analysis (closed-form similarity transform: translate + rotate + uniform scale) aligns live landmarks to rest-pose landmarks using 8 bony anchor points. Per-landmark deltas are computed, EMA-smoothed (α=0.35), then converted to vertex-space displacements. This drives real-time facial expressions on the 3D model from webcam input.

6. Voice Cloning — ElevenLabs Voice Add API

  • API: POST /v1/voices/add
  • What it does: During calibration, the user speaks an incantation. The recorded audio is uploaded to ElevenLabs, which creates a cloned voice profile and returns a voice_id. This voice is then used for all TTS responses — the mirror literally speaks back in your voice.

7. Text-to-Speech with Timestamps — ElevenLabs Multilingual V2

  • Model: eleven_multilingual_v2
  • API: POST /v1/text-to-speech/{voice_id}/with-timestamps
  • What it does: Synthesizes the mirror's response text into audio using the cloned voice, and returns character-level start/end timestamps. These timestamps are critical for lip-sync.

8. Speech-to-Text — ElevenLabs Scribe V1

  • Model: scribe_v1
  • API: POST /v1/speech-to-text
  • What it does: Transcribes the user's spoken question into text so it can be sent to the LLM.

9. Conversational AI — Claude (Anthropic)

  • Model: claude-opus-4-6
  • API: Anthropic Messages API (POST /v1/messages)
  • System prompt: "You are the reflection in this mirror, speaking back to the face looking at you. Respond in one short sentence. Be direct, honest, and a little surprising — like something your own face might actually say."
  • What it does: Generates the mirror's spoken response. Maintains multi-turn conversation history so the mirror has context.

10. Viseme-based Lip Sync — Custom phoneme→viseme pipeline

  • Technique: Rule-based English text-to-phoneme conversion (digraph + single-char rules → 13 phoneme classes: REST, MBP, FV, TH, TDNL, SZ, SH, R, KG, W, OH, AH, EE). Each phoneme maps to a viseme (mouth shape). During calibration, the incantation's known phoneme sequence is time-aligned to the captured landmark frames, extracting a representative deformation delta per viseme. At playback, ElevenLabs' character timestamps are merged with the phoneme sequence to produce a timed viseme timeline, and the face mesh is deformed frame-by-frame with lerp blending (t=0.7) to animate the mouth in sync with audio.

11. Voice Activity Detection (VAD) — Web Audio API + RMS threshold

  • Technique: AnalyserNode computes RMS energy from the microphone stream in real-time. Once speech is detected (RMS > 0.01 threshold), a 1.5s silence timer triggers automatic recording stop. This gives natural conversational turn-taking — the mirror knows when you've stopped speaking.

12. Motion Control — Multi-source input fusion

  • Technique: Fuses gaze tracking (highest priority), device orientation (gyroscope), and mouse/touch drag to control the 3D model's rotation. Uses exponential smoothing (1 - e^(-speed·dt)) for fluid interpolation, with per-source lerp speeds and priority arbitration.

13. Persistence — IndexedDB

  • Scans (landmark bundles, JPEG textures, depth data) and voice data (cloned voice IDs, per-viseme deformation shapes) are persisted client-side in IndexedDB, with schema migrations across 3 versions.

In summary: The user looks into their camera. MediaPipe tracks their face. Depth Anything gives it depth. WebGPU renders a 3D replica. They speak an incantation — ElevenLabs clones their voice while Procrustes extracts their mouth shapes. They ask a question — Scribe transcribes it, Claude answers, ElevenLabs speaks the answer in their own cloned voice, and the 3D face lip-syncs using the captured viseme shapes. The mirror talks back.