Building sign language conversations from individual signs

Youngmin Kim, Kyobin Choo, Jiwoo Park, Minseo Kim, Chanyoung Kim, Junhyeok Kim, Seong Jae Hwang

Sign language users often rely on spoken or written language intermediaries in conversational AI, limiting accessibility. This work addresses the scarcity of sentence-level sign video data by constructing continuous sign conversations from large collections of isolated signs labeled with motion primitives. SignaVox-W provides the largest isolated-sign vocabulary to date; SignaVox-U is a 3D conversation dataset built from it. The system uses a retrieval-guided translator to bridge spoken and signed language structures, and BRAID—a diffusion Transformer—aligns durations and fills co-articulatory gaps between independently collected clips. The resulting SignaVox model generates full-body, hand, and facial responses directly from prior signing context, without relying on text or glosses at inference time.