Researchers at Zlab Princeton have unveiled the LLM-Pruning Collection, a sophisticated, JAX-based repository that consolidates prominent pruning algorithms for large language models (LLMs) into a unified and reproducible framework. The primary objective of this release is to facilitate straightforward comparisons of block-level, layer-level, and weight-level pruning strategies within a consistent training and evaluation environment, adaptable for both GPUs and TPUs.
Inside the LLM-Pruning Collection
Described as a dedicated JAX repository for LLM compression, the collection is logically organized into three core directories:
pruning: This section houses implementations for various established pruning methodologies, including Minitron, ShortGPT, Wanda, SparseGPT, Magnitude, Sheared Llama, and LLM-Pruner.training: Providing robust integration, this directory supports FMS-FSDP for GPU-accelerated training and MaxText specifically for TPU environments.eval: Here, developers will find JAX-compatible evaluation scripts built upon the widely recognized lm-eval-harness. Notably, it includes accelerate-based support for MaxText, which can yield a 2 to 4 times speedup in evaluation processes.
Featured Pruning Techniques
The LLM-Pruning Collection encompasses a diverse array of pruning algorithms, each targeting different levels of granularity:
- Minitron: Developed by NVIDIA, Minitron is a practical pruning and distillation method that effectively compresses models such as Llama 3.1 8B and Mistral NeMo 12B to smaller sizes (4B and 8B respectively) while maintaining performance. It involves exploring depth pruning and joint width pruning of various components, followed by distillation. Scripts within the
pruning/minitronfolder facilitate Minitron-style pruning on models like Llama 3.1 8B. - ShortGPT: This method operates on the premise that many Transformer layers exhibit redundancy. ShortGPT introduces "Block Influence," a metric for gauging each layer's contribution, subsequently removing less impactful layers through direct deletion. Experimental results indicate ShortGPT's superior performance over earlier pruning techniques across both multiple-choice and generative tasks. Its implementation is found within the Minitron folder, featuring a dedicated script for Llama 2-7B.
- Wanda, SparseGPT, and Magnitude: These three post-training pruning methods are grouped together. Wanda evaluates weights by multiplying their magnitude with corresponding input activation on an output-by-output basis, pruning the lowest-scoring elements without requiring retraining. SparseGPT employs a second-order inspired reconstruction step to prune large GPT-style models at high sparsity ratios. Magnitude pruning serves as a foundational baseline, removing weights with minimal absolute values. All three reside under
pruning/wanda, and the repository's documentation includes detailed comparison tables for Llama 2 7B across various benchmarks, covering both unstructured and structured sparsity patterns. - Sheared Llama: A structured pruning approach, Sheared LLaMA learns masks for layers, attention heads, and hidden dimensions, followed by retraining the optimized architecture. The
pruning/llmshearingdirectory in the collection incorporates this recipe, employing a RedPajama subset for calibration, sourced through Hugging Face, alongside helper scripts for converting between Hugging Face and MosaicML Composer formats. - LLM-Pruner: This framework specializes in the structural pruning of large language models. It identifies and removes non-critical, coupled structures (e.g., attention heads or MLP channels) by leveraging gradient-based importance scores. Subsequent performance recovery is achieved through a brief LoRA tuning phase, often using around 50,000 samples. LLM-Pruner is included under
pruning/LLM-Pruner, offering scripts compatible with LLaMA, LLaMA 2, and Llama 3.1 8B.
Key Contributions and Advantages
The LLM-Pruning Collection represents a significant stride in LLM optimization. It unifies contemporary pruning strategies with consistent pipelines for training and evaluation on diverse hardware. The codebase not only implements a range of granular pruning approaches but also includes model-specific scripts for popular Llama family models. A notable feature is its capacity to reproduce core experimental results from previous pruning research, offering engineers "paper vs. reproduced" tables to validate their own runs against established benchmarks. This comprehensive resource aims to accelerate the development and deployment of more efficient large language models.
This article is a rewritten summary based on publicly available reporting. For the original story, visit the source.
Source: MarkTechPost