A face for every bawt.
The 3D avatar surface is BawtHub's most experimental feature — three model formats, two skeleton conventions, one shared scene, lip sync driven by audio amplitude instead of phonemes, and a bone-retargeting layer so a Mixamo dance clip lands cleanly on a VRM rig from a different vendor. It's labeled "experimental" in the README, but it's been load-bearing in the voice surface for over a year.
01 Three formats, one component.
AvatarModel.tsx dispatches to a per-format loader shell based on file extension:
| Format | Loader | Origin | Lip-sync path |
|---|---|---|---|
.vrm | GLTFLoader + VRMLoaderPlugin | VRoid, ChatVRM exports, Booth marketplace | VRM expression Aa if present, morph fallback otherwise |
.glb | GLTFLoader | Sketchfab, Mixamo + custom rigs | Shared morph names across child meshes |
.fbx | FBXLoader | Mixamo characters | Single-mesh morph target |
Each shell loads its asset with useLoader, then funnels into three shared hooks: useModelSetup (mesh/morph/material/bone discovery and overrides), useAnimations (FBX clip loading + retarget), and useLipSync (the 60 fps frame loop that drives mouth, blink, breath, and head sway). The three loaders are different by necessity — VRM has a normalized humanoid skeleton, FBX has Mixamo's nonstandard naming, GLB is a wild west — but the hooks all collapse onto the same interface.
02 Lip sync: amplitude, not phonemes.
The avatar doesn't try to read phoneme timings out of TTS. Instead, the audio output element is piped through an AnalyserNode (fftSize: 1024, smoothing: 0.72), and every frame useLipSync:
- Pulls
frequencyBinCountbytes viagetByteFrequencyData - Sums energy across roughly the 2%–17% bin range (voice-band, skipping rumble and the high hiss)
- Takes the RMS, calls anything above 0.02 "speaking"
- Lerps
jawOpentowardamp * 0.8with a smoothing factor of 0.3 per frame
That jawOpen value drives whichever mouth-open channel the model exposes — a VRM Aa expression, a morph target keyed by name discovery (mouth-open, jaw-open, aa, a), or a GLB shared-morph that gets written to every child mesh that has the matching name. Blink is procedural too: a sine pulse on (t % 4.3) < 0.14 drives VRMExpressionPresetName.Blink or the discovered blink morph.
Phoneme timing means alignment with whatever TTS engine is in use. Moshi exposes word timings but not phonemes. Azure exposes visemes but only for some voices. Amplitude doesn't care which engine speaks — and the result is visually convincing because human eyes mostly read mouth-open-vs-mouth-shut from across a room, not specific viseme shapes.
03 Speaking motion on top of everything else.
When the avatar is speaking, the frame loop layers per-bone perturbations on top of any active animation clip:
neck.rotation.x += sin(t * 4.2) * amp * 0.025— head nod synchronized to amplitudehead.rotation.x += sin(t * 5.5) * amp * 0.08 + sin(t * 3.1) * amp * 0.04— head tilt blendhead.rotation.y += sin(t * 2.3) * amp * 0.03— small yaw drifthead.rotation.z += sin(t * 4.7) * amp * 0.02— head roll
Idle motion (when no clip is playing and the avatar isn't speaking) adds a breath cycle on the spine and head, plus a slow drift on the neck. All of this composes — a Mixamo "head nod yes" clip plays through the mixer, the speaking perturbations layer on top with premultiply, and the result is animation that doesn't look canned.
04 Mixamo to VRM: the retargeting problem.
Animations come from Mixamo. Avatars come from VRoid, Booth, Sketchfab, and custom Blender exports. The bones don't match. Mixamo names its bones mixamorig:Head, mixamorig:LeftArm, etc.; VRM normalizes to head, leftUpperArm, etc.; a custom GLB might use anything at all.
retarget.ts handles three problems at once:
- Name mapping.
MIXAMO_TO_VRM_BONEinlib/boneMapping.tsis the canonical map.retargetAnimationClipwalks every track in a clip, stripsmixamorig:, looks up the target bone name, and drops tracks with no mapping (e.g.HeadTop_End, finger end-bones). - Rest pose correction. A T-pose source clip applied to an A-pose target wrist will get bent backward. The retargeter captures rest-pose quaternions for both source and target before any animation runs, then composes
q_target = sourceRest⁻¹ × q_source × targetReston each track so the animation deforms from the target's rest pose. - Axis convention. VRM normalized bones use Z-axis-forward; Mixamo FBX bones use X-axis-forward. The arm-space offset in
useLipSyncpicks the right axis based onmodelTypeand mirrors the right arm against the left.
05 Persisted per-model overrides.
Every model can have a stored override blob keyed by model path. The Prisma AvatarSettings table holds:
| Field | What it overrides |
|---|---|
meshVisibility | Per-mesh on/off — hide a hat, kill an oversized hair fan, disable a particle emitter |
morphOverrides | Per-morph fixed values keyed by meshName::morphName — set a smile to 0.3, keep eyes half-closed, override lip-sync output |
materialOverrides.materialVisibility | Per-material on/off keyed by meshName::materialName |
materialOverrides.colorTints | Per-material hex tint overrides |
materialOverrides.shininess | Global shininess for non-PBR materials |
armSpaceOffset | Degrees of arm-down rotation to apply on top of rest pose — fixes T-pose-vs-A-pose mismatches |
sceneSettings | Background, fog, post-processing toggles (free-form JSON) |
An override on a morph target wins over lip sync — if you've manually set mouth_open for a particular model (because, say, the auto-discovered morph is the wrong one and lip sync looks weird), useLipSync notices the key is in morphOverrides and skips it. The same applies to materials and meshes.
06 The bone mapping editor.
Auto-mapping works for most models — autoMapBonesToMap uses name-pattern heuristics to match a model's actual bone names to the Mixamo canonical names. But it sometimes fails — a custom rig that names its head bone BONE_NULL_HEAD isn't going to auto-match.
BoneMappingAdmin is the human escape hatch. It groups bones by body part (Core, Left Arm, Right Arm, Left Leg, Right Leg, Left Hand, Right Hand) and lets you pick from the model's bone list for each canonical Mixamo slot. The result is persisted to the BoneMapping Prisma table and overrides auto-detection on subsequent loads. The page also runs animation triggers — load a clip, see how it lands, tweak the mapping, save.
07 Cameras and presets.
The viewer auto-computes camera presets from each model's bounding box on load — Full Body, Mid, Close-Up, Face. CameraRig handles smooth transitions between presets with damped position + target interpolation. Camera presets are persisted per-model so the next time you load the same VRM, the camera comes up where you left it. The orbit controls remain available for live exploration.
08 The animation library.
Animation clips are FBX files in /models/animations/:
"Breathing Idle" is the auto-play default. Clips above it are reaction animations — the LLM can name one as part of its response (parsed by the voice pipeline into a BawtHubServiceAnimation event) and useAvatarStore.triggerAnimation queues it. The avatar viewer consumes the pending animation, plays it once, then returns to idle.
An AnimationTriggersAdmin page lets you wire animation cues to LLM keywords, set fallback animations per emotion class, and preview each clip on the current model with retargeting applied.
09 How the avatar finds its audio.
The avatar is a separate R3F canvas, but it doesn't run its own audio context — it borrows the one from BawtHub.tsx. When the voice call's useRealtimeAudioOutput creates the AudioContext + AudioWorkletNode + AnalyserNode, the analyser ref is exposed to the avatar via component props. The avatar's lip sync hook reads from that same analyser — meaning the mouth opens at the same moment the speakers do, with no extra audio routing overhead.
On the /avatar page (where there's no active voice call), the model still loads and animates — idle motion, breath, blink — but isSpeaking stays false and the analyser is null, so the lip sync hook just runs the procedural-only branch. The mode flips back the instant a voice call starts.
10 Key files.
frontend/src/app/avatar/AvatarModel.tsxfrontend/src/app/avatar/useLipSync.tsfrontend/src/app/avatar/useModelSetup.tsfrontend/src/app/avatar/useAnimations.tsonClipComplete when the clip finishes.frontend/src/app/avatar/retarget.tssourceRest⁻¹ × q × targetRest), track dropping for unmapped end-bones.frontend/src/app/avatar/CameraRig.tsx · SceneLighting.tsx · SceneExposer.tsxfrontend/src/app/avatar/persistence.ts/api/avatar/settings.frontend/src/app/avatar/utils.tsfindBone, findMorph, findMorphByKeyword, discoverModel — the heuristics that auto-map mouth and blink targets.frontend/src/lib/boneMapping.tsSTANDARD_BONE_GROUPS for the admin UI, autoMapBonesToMap for first-load auto-detection.frontend/src/app/BoneMappingAdmin.tsx · AnimationTriggersAdmin.tsxfrontend/src/app/AvatarViewer.tsx/avatar and as an inline embed in the voice page.frontend/src/app/avatar/proceduralPose.ts · poseLibrary.tsmain on 2026-05-13
Source: bawthub repo (private)