Tencent Hunyuan has released its High-Performance Computing Operations (HPC-Ops) library, an open-source solution designed to boost large language model (LLM) inference. This production-ready library optimizes core computational operations, benefiting AI developers.
Driving Performance in Production
HPC-Ops focuses on low-level CUDA kernels for core operations like Attention, Grouped GEMM, and Fused Mixture of Experts (MoE). These are exposed via compact C and Python APIs for seamless integration into existing LLM inference stacks.
Already effective in Tencent's internal services, HPC-Ops demonstrates notable improvements in queries per minute (QPM): approximately 30% for Tencent-HY models and 17% for DeepSeek models on mainstream inference cards. These service-level gains reflect the positive impact of faster kernels within active pipelines.
Modular Design for Seamless Integration
Developed by the Tencent Hunyuan AI Infra team, HPC-Ops is a high-performance, user-friendly operator library for LLM inference. It provides optimized kernels and clear APIs for systems managing tasks like scheduling, KV cache handling, batching, and data transport, rather than replacing entire serving frameworks.
This modular approach ensures effortless adoption within popular frameworks like vLLM and SGLang. Teams can substitute their underlying kernel implementations with HPC-Ops while maintaining external service behavior. Built on C++ and CUDA, it leverages CuTe and CUTLASS, with kernels also serving as modern CUDA tutorials.
Microbenchmark Excellence: Unprecedented Speedups
Detailed microbenchmark results showcase HPC-Ops' significant speedups against established baselines, highlighting its optimization potential.
- For Attention (BF16), it achieves up to 1.33x faster prefill and 2.22x faster decode compared to leading alternatives.
- In FP8 Attention, gains reach 1.12x prefill and 2.0x decode.
- Fused MoE (FP8) shows up to 1.49x prefill and 1.14x decode.
- Grouped GEMM (FP8) achieves up to 1.1x prefill and 1.88x decode.
These substantial improvements, especially in decode operations, are critical. Decode is often the primary latency bottleneck in autoregressive generation, where memory traffic dominates. HPC-Ops' focus here directly addresses a key user experience concern.
Core Operator Families and Precision Support
HPC-Ops' initial release categorizes functionality into three primary operator families:
- Attention Kernels: Support prefill and decode, including paged attention for optimizing memory reuse in long sequences.
- Grouped GEMM: Quantized with FP8 weights, offering block-wise and per-tensor scaling for granularity-overhead balance.
- Fused-MoE: Integrates routing and expert computation in a single quantized operation, using FP8 expert weights and supporting block-wise/per-tensor scaling.
Across these, HPC-Ops natively supports BF16 and FP8 data types. This aligns with production trends toward lower precision, maintaining accuracy while reducing memory bandwidth and enhancing tensor core utilization.
Future Enhancements
Tencent Hunyuan has outlined an ambitious roadmap for HPC-Ops. Future developments include sparse attention for long-context LLMs, expanded 4-bit and 8-bit quantization, and advanced kernels designed to overlap computation with multi-GPU communication, promising further gains in distributed inference.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost