The 3D Digital Human team at Tencent Hunyuan has announced the release of HY-Motion 1.0, an innovative family of open-weight models for generating 3D human motion from text. This significant development marks a milestone in AI animation, as the system leverages a Diffusion Transformer (DiT) architecture combined with Flow Matching, scaled to an impressive one billion parameters.
HY-Motion 1.0 is engineered to translate natural language prompts and specified durations into detailed 3D human motion clips. These motions are produced on a standardized SMPL-H skeleton, making them readily adaptable for various applications. The model's code, trained checkpoints, and a user-friendly Gradio interface are publicly available on both GitHub and Hugging Face, enabling immediate local deployment for researchers and developers.
Empowering 3D Developers and Animators
This new model series presents substantial capabilities for those working with 3D animation. HY-Motion 1.0 is fundamentally a text-to-3D human motion generator, built upon a Diffusion Transformer, DiT, and trained using a Flow Matching objective. The release includes two distinct versions: the standard HY-Motion-1.0 with 1.0 billion parameters, and a more lightweight option, HY-Motion-1.0-Lite, featuring 0.46 billion parameters.
Both models are capable of creating skeleton-based 3D character animations directly from straightforward text commands. The resulting motion sequences, rendered on an SMPL-H skeleton, are compatible with existing 3D animation and game development pipelines. This facilitates their integration into projects involving digital humans, cinematic sequences, and interactive virtual characters. To maximize accessibility, the release provides inference scripts, a command-line interface for batch processing, and a Gradio web application, with support across macOS, Windows, and Linux operating systems.
The Technical Backbone: DiT and Flow Matching
At its core, HY-Motion 1.0 employs a sophisticated hybrid HY Motion DiT network. Initially, dual-stream blocks process motion latents and text tokens independently, each with dedicated QKV projections and MLPs. A joint attention module then allows motion tokens to semantically query text features while preserving modality-specific structures. Subsequently, the network transitions to single-stream blocks where motion and text tokens are concatenated and processed together using parallel spatial and channel attention for deeper multimodal integration.
Text conditioning is managed through a dual-encoder setup, utilizing Qwen3 8B for token-level embeddings and a CLIP-L model for global text features. A Bidirectional Token Refiner addresses the causal attention bias of the language model, ensuring non-autoregressive generation. Attention within the system is asymmetric, allowing motion tokens to attend to all text tokens without text tokens attending back to motion. This design prevents noisy motion states from corrupting the linguistic representation. Furthermore, Flow Matching, instead of traditional denoising diffusion, is central to the model's learning process. It enables the model to learn a velocity field along a continuous path, interpolating between Gaussian noise and actual motion data, leading to stable training for extended sequences.
Building the Foundation: Data and Training
The creation of HY-Motion 1.0 involved an extensive and meticulously curated dataset derived from three primary sources: real-world human motion videos, professional motion capture data, and 3D animation assets from game production. A substantial 12 million high-quality video clips were processed, involving scene segmentation, human detection, and the application of the GVHMR algorithm to reconstruct SMPL-X motion tracks. An additional 500 hours of motion sequences came from motion capture sessions and 3D animation libraries.
All data underwent rigorous processing, including retargeting onto a unified SMPL-H skeleton and a multi-stage filtering process to eliminate duplicates, abnormal poses, and various artifacts. The final dataset encompasses over 3,000 hours of motion, with a premium subset of 400 hours comprising high-quality 3D motion accompanied by verified captions. A comprehensive three-level taxonomy further organizes this data into over 200 fine-grained motion categories, spanning broad classes like Locomotion, Sports, and Daily Activities.
Performance and Scalability Benchmarks
Evaluations conducted on a test set of over 2,000 prompts demonstrate HY-Motion 1.0's superior performance. Human raters assessed instruction following and motion quality on a 1-5 scale. HY-Motion 1.0 achieved an average instruction following score of 3.24 and an SSAE score of 78.6 percent, significantly outperforming baseline text-to-motion systems like DART and MoMask, which scored between 2.17 and 2.31. For motion quality, HY-Motion 1.0 averaged 3.43, compared to the best baseline's 3.11.
Scaling experiments confirmed that instruction following capabilities improved consistently with larger model sizes, with the 1-billion parameter model reaching an average score of 3.34. Motion quality showed saturation around the 0.46-billion parameter scale. These studies also highlighted that greater data volume is crucial for aligning instructions, while high-quality data curation primarily enhances realism.
Conclusion
HY-Motion 1.0 represents a significant leap forward in the field of AI-driven 3D human motion generation. By integrating a billion-parameter Diffusion Transformer with Flow Matching and a meticulously curated dataset, Tencent has delivered an open-weight solution that offers unparalleled fidelity and instruction following. This technology promises to empower creators across animation, gaming, and digital human development, streamlining workflows and expanding the creative possibilities for character motion.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost