BawtHub · avatar

A face for every bawt.

The 3D avatar surface is BawtHub's most experimental feature — three model formats, two skeleton conventions, one shared scene, lip sync driven by audio amplitude instead of phonemes, and a bone-retargeting layer so a Mixamo dance clip lands cleanly on a VRM rig from a different vendor. It's labeled "experimental" in the README, but it's been load-bearing in the voice surface for over a year.

Renderer: three.js 0.183 · @react-three/fiber 9 · @react-three/drei 10 VRM: @pixiv/three-vrm 3.5 Lip sync: AnalyserNode FFT · 60 fps morph drive

01 Three formats, one component.

AvatarModel.tsx dispatches to a per-format loader shell based on file extension:

Format	Loader	Origin	Lip-sync path
`.vrm`	`GLTFLoader + VRMLoaderPlugin`	VRoid, ChatVRM exports, Booth marketplace	VRM expression `Aa` if present, morph fallback otherwise
`.glb`	`GLTFLoader`	Sketchfab, Mixamo + custom rigs	Shared morph names across child meshes
`.fbx`	`FBXLoader`	Mixamo characters	Single-mesh morph target

Each shell loads its asset with useLoader, then funnels into three shared hooks: useModelSetup (mesh/morph/material/bone discovery and overrides), useAnimations (FBX clip loading + retarget), and useLipSync (the 60 fps frame loop that drives mouth, blink, breath, and head sway). The three loaders are different by necessity — VRM has a normalized humanoid skeleton, FBX has Mixamo's nonstandard naming, GLB is a wild west — but the hooks all collapse onto the same interface.

02 Lip sync: amplitude, not phonemes.

The avatar doesn't try to read phoneme timings out of TTS. Instead, the audio output element is piped through an AnalyserNode (fftSize: 1024, smoothing: 0.72), and every frame useLipSync:

Pulls frequencyBinCount bytes via getByteFrequencyData
Sums energy across roughly the 2%–17% bin range (voice-band, skipping rumble and the high hiss)
Takes the RMS, calls anything above 0.02 "speaking"
Lerps jawOpen toward amp * 0.8 with a smoothing factor of 0.3 per frame

That jawOpen value drives whichever mouth-open channel the model exposes — a VRM Aa expression, a morph target keyed by name discovery (mouth-open, jaw-open, aa, a), or a GLB shared-morph that gets written to every child mesh that has the matching name. Blink is procedural too: a sine pulse on (t % 4.3) < 0.14 drives VRMExpressionPresetName.Blink or the discovered blink morph.

✦

Why not phoneme-driven viseme animation?

Phoneme timing means alignment with whatever TTS engine is in use. Moshi exposes word timings but not phonemes. Azure exposes visemes but only for some voices. Amplitude doesn't care which engine speaks — and the result is visually convincing because human eyes mostly read mouth-open-vs-mouth-shut from across a room, not specific viseme shapes.

03 Speaking motion on top of everything else.

When the avatar is speaking, the frame loop layers per-bone perturbations on top of any active animation clip:

neck.rotation.x += sin(t * 4.2) * amp * 0.025 — head nod synchronized to amplitude
head.rotation.x += sin(t * 5.5) * amp * 0.08 + sin(t * 3.1) * amp * 0.04 — head tilt blend
head.rotation.y += sin(t * 2.3) * amp * 0.03 — small yaw drift
head.rotation.z += sin(t * 4.7) * amp * 0.02 — head roll

Idle motion (when no clip is playing and the avatar isn't speaking) adds a breath cycle on the spine and head, plus a slow drift on the neck. All of this composes — a Mixamo "head nod yes" clip plays through the mixer, the speaking perturbations layer on top with premultiply, and the result is animation that doesn't look canned.

04 Mixamo to VRM: the retargeting problem.

Animations come from Mixamo. Avatars come from VRoid, Booth, Sketchfab, and custom Blender exports. The bones don't match. Mixamo names its bones mixamorig:Head, mixamorig:LeftArm, etc.; VRM normalizes to head, leftUpperArm, etc.; a custom GLB might use anything at all.

retarget.ts handles three problems at once:

Name mapping. MIXAMO_TO_VRM_BONE in lib/boneMapping.ts is the canonical map. retargetAnimationClip walks every track in a clip, strips mixamorig:, looks up the target bone name, and drops tracks with no mapping (e.g. HeadTop_End, finger end-bones).
Rest pose correction. A T-pose source clip applied to an A-pose target wrist will get bent backward. The retargeter captures rest-pose quaternions for both source and target before any animation runs, then composes q_target = sourceRest⁻¹ × q_source × targetRest on each track so the animation deforms from the target's rest pose.
Axis convention. VRM normalized bones use Z-axis-forward; Mixamo FBX bones use X-axis-forward. The arm-space offset in useLipSync picks the right axis based on modelType and mirrors the right arm against the left.

05 Persisted per-model overrides.

Every model can have a stored override blob keyed by model path. The Prisma AvatarSettings table holds:

Field	What it overrides
`meshVisibility`	Per-mesh on/off — hide a hat, kill an oversized hair fan, disable a particle emitter
`morphOverrides`	Per-morph fixed values keyed by `meshName::morphName` — set a smile to 0.3, keep eyes half-closed, override lip-sync output
`materialOverrides.materialVisibility`	Per-material on/off keyed by `meshName::materialName`
`materialOverrides.colorTints`	Per-material hex tint overrides
`materialOverrides.shininess`	Global shininess for non-PBR materials
`armSpaceOffset`	Degrees of arm-down rotation to apply on top of rest pose — fixes T-pose-vs-A-pose mismatches
`sceneSettings`	Background, fog, post-processing toggles (free-form JSON)

An override on a morph target wins over lip sync — if you've manually set mouth_open for a particular model (because, say, the auto-discovered morph is the wrong one and lip sync looks weird), useLipSync notices the key is in morphOverrides and skips it. The same applies to materials and meshes.

06 The bone mapping editor.

Auto-mapping works for most models — autoMapBonesToMap uses name-pattern heuristics to match a model's actual bone names to the Mixamo canonical names. But it sometimes fails — a custom rig that names its head bone BONE_NULL_HEAD isn't going to auto-match.

BoneMappingAdmin is the human escape hatch. It groups bones by body part (Core, Left Arm, Right Arm, Left Leg, Right Leg, Left Hand, Right Hand) and lets you pick from the model's bone list for each canonical Mixamo slot. The result is persisted to the BoneMapping Prisma table and overrides auto-detection on subsequent loads. The page also runs animation triggers — load a clip, see how it lands, tweak the mapping, save.

07 Cameras and presets.

The viewer auto-computes camera presets from each model's bounding box on load — Full Body, Mid, Close-Up, Face. CameraRig handles smooth transitions between presets with damped position + target interpolation. Camera presets are persisted per-model so the next time you load the same VRM, the camera comes up where you left it. The orbit controls remain available for live exploration.

08 The animation library.

Animation clips are FBX files in /models/animations/:

Bundled clips · all Mixamo, loaded via FBXLoader, retargeted at runtime

Breathing Idle Acknowledging Angry Gesture Annoyed Head Shake Being Cocky Dismissing Gesture Happy Hand Gesture Hard Head Nod Head Nod Yes Lengthy Head Nod Look Away Gesture Relieved Sigh Sarcastic Head Nod Shaking Head No Thoughtful Head Shake Weight Shift

"Breathing Idle" is the auto-play default. Clips above it are reaction animations — the LLM can name one as part of its response (parsed by the voice pipeline into a BawtHubServiceAnimation event) and useAvatarStore.triggerAnimation queues it. The avatar viewer consumes the pending animation, plays it once, then returns to idle.

An AnimationTriggersAdmin page lets you wire animation cues to LLM keywords, set fallback animations per emotion class, and preview each clip on the current model with retargeting applied.

09 How the avatar finds its audio.

The avatar is a separate R3F canvas, but it doesn't run its own audio context — it borrows the one from BawtHub.tsx. When the voice call's useRealtimeAudioOutput creates the AudioContext + AudioWorkletNode + AnalyserNode, the analyser ref is exposed to the avatar via component props. The avatar's lip sync hook reads from that same analyser — meaning the mouth opens at the same moment the speakers do, with no extra audio routing overhead.

On the /avatar page (where there's no active voice call), the model still loads and animates — idle motion, breath, blink — but isSpeaking stays false and the analyser is null, so the lip sync hook just runs the procedural-only branch. The mode flips back the instant a voice call starts.

10 Key files.

frontend/src/app/avatar/AvatarModel.tsx

The dispatcher. One component, three loader shells (FBX/GLB/VRM), all funneling into the shared hooks.

frontend/src/app/avatar/useLipSync.ts

The 60 fps frame loop. AnalyserNode read, amplitude smoothing, mouth/blink/head drive, arm-space offset, speaking-motion perturbations on top of clips.

frontend/src/app/avatar/useModelSetup.ts

Discovery and overrides. Walks the loaded scene, collects meshes/morphs/materials/bones, applies persisted overrides, builds bone refs.

frontend/src/app/avatar/useAnimations.ts

FBX clip loader. Lazy-loads Mixamo clips, retargets to the current rig, manages the AnimationMixer + AnimationActions registry, fires onClipComplete when the clip finishes.

frontend/src/app/avatar/retarget.ts

The retargeter. Bone-name remap, rest-pose correction (sourceRest⁻¹ × q × targetRest), track dropping for unmapped end-bones.

frontend/src/app/avatar/CameraRig.tsx · SceneLighting.tsx · SceneExposer.tsx

Scene chrome. Camera transitions, three-point lighting, scene-graph exposure for the admin UIs.

frontend/src/app/avatar/persistence.ts

The override bridge. Load + save per-model settings against /api/avatar/settings.

frontend/src/app/avatar/utils.ts

Mesh/morph/bone discovery. findBone, findMorph, findMorphByKeyword, discoverModel — the heuristics that auto-map mouth and blink targets.

frontend/src/lib/boneMapping.ts

The canonical bone map. Mixamo-to-VRM lookup, STANDARD_BONE_GROUPS for the admin UI, autoMapBonesToMap for first-load auto-detection.

frontend/src/app/BoneMappingAdmin.tsx · AnimationTriggersAdmin.tsx

The escape hatches. Human-editable mapping when auto-detection fails; LLM-keyword → animation routing.

frontend/src/app/AvatarViewer.tsx

The R3F canvas wrapper. Used both at /avatar and as an inline embed in the voice page.

frontend/src/app/avatar/proceduralPose.ts · poseLibrary.ts

Procedural idle. Pose blends used when no clip is playing, plus a small library of static poses callable by name.

PreviousVoice NextSurfaces

Validated against main on 2026-05-13 Source: bawthub repo (private)