Unlock Efficiency: Zlab Princeton's Unified JAX Toolkit Revolutionizes LLM Compression

Researchers at Zlab Princeton have unveiled the LLM-Pruning Collection, a sophisticated, JAX-based repository that consolidates prominent pruning algorithms for large language models (LLMs) into a unified and reproducible framework. The primary objective of this release is to facilitate straightforward comparisons of block-level, layer-level, and weight-level pruning strategies within a consistent training and evaluation environment, adaptable for both GPUs and TPUs.

Inside the LLM-Pruning Collection

Described as a dedicated JAX repository for LLM compression, the collection is logically organized into three core directories:

pruning: This section houses implementations for various established pruning methodologies, including Minitron, ShortGPT, Wanda, SparseGPT, Magnitude, Sheared Llama, and LLM-Pruner.
training: Providing robust integration, this directory supports FMS-FSDP for GPU-accelerated training and MaxText specifically for TPU environments.
eval: Here, developers will find JAX-compatible evaluation scripts built upon the widely recognized lm-eval-harness. Notably, it includes accelerate-based support for MaxText, which can yield a 2 to 4 times speedup in evaluation processes.

Featured Pruning Techniques

The LLM-Pruning Collection encompasses a diverse array of pruning algorithms, each targeting different levels of granularity:

Minitron: Developed by NVIDIA, Minitron is a practical pruning and distillation method that effectively compresses models such as Llama 3.1 8B and Mistral NeMo 12B to smaller sizes (4B and 8B respectively) while maintaining performance. It involves exploring depth pruning and joint width pruning of various components, followed by distillation. Scripts within the pruning/minitron folder facilitate Minitron-style pruning on models like Llama 3.1 8B.
ShortGPT: This method operates on the premise that many Transformer layers exhibit redundancy. ShortGPT introduces "Block Influence," a metric for gauging each layer's contribution, subsequently removing less impactful layers through direct deletion. Experimental results indicate ShortGPT's superior performance over earlier pruning techniques across both multiple-choice and generative tasks. Its implementation is found within the Minitron folder, featuring a dedicated script for Llama 2-7B.
Wanda, SparseGPT, and Magnitude: These three post-training pruning methods are grouped together. Wanda evaluates weights by multiplying their magnitude with corresponding input activation on an output-by-output basis, pruning the lowest-scoring elements without requiring retraining. SparseGPT employs a second-order inspired reconstruction step to prune large GPT-style models at high sparsity ratios. Magnitude pruning serves as a foundational baseline, removing weights with minimal absolute values. All three reside under pruning/wanda, and the repository's documentation includes detailed comparison tables for Llama 2 7B across various benchmarks, covering both unstructured and structured sparsity patterns.
Sheared Llama: A structured pruning approach, Sheared LLaMA learns masks for layers, attention heads, and hidden dimensions, followed by retraining the optimized architecture. The pruning/llmshearing directory in the collection incorporates this recipe, employing a RedPajama subset for calibration, sourced through Hugging Face, alongside helper scripts for converting between Hugging Face and MosaicML Composer formats.
LLM-Pruner: This framework specializes in the structural pruning of large language models. It identifies and removes non-critical, coupled structures (e.g., attention heads or MLP channels) by leveraging gradient-based importance scores. Subsequent performance recovery is achieved through a brief LoRA tuning phase, often using around 50,000 samples. LLM-Pruner is included under pruning/LLM-Pruner, offering scripts compatible with LLaMA, LLaMA 2, and Llama 3.1 8B.

Key Contributions and Advantages

The LLM-Pruning Collection represents a significant stride in LLM optimization. It unifies contemporary pruning strategies with consistent pipelines for training and evaluation on diverse hardware. The codebase not only implements a range of granular pruning approaches but also includes model-specific scripts for popular Llama family models. A notable feature is its capacity to reproduce core experimental results from previous pruning research, offering engineers "paper vs. reproduced" tables to validate their own runs against established benchmarks. This comprehensive resource aims to accelerate the development and deployment of more efficient large language models.

Inside the LLM-Pruning Collection

Described as a dedicated JAX repository for LLM compression, the collection is logically organized into three core directories:

pruning: This section houses implementations for various established pruning methodologies, including Minitron, ShortGPT, Wanda, SparseGPT, Magnitude, Sheared Llama, and LLM-Pruner.

training: Providing robust integration, this directory supports FMS-FSDP for GPU-accelerated training and MaxText specifically for TPU environments.

eval: Here, developers will find JAX-compatible evaluation scripts built upon the widely recognized lm-eval-harness. Notably, it includes accelerate-based support for MaxText, which can yield a 2 to 4 times speedup in evaluation processes.

Featured Pruning Techniques

The LLM-Pruning Collection encompasses a diverse array of pruning algorithms, each targeting different levels of granularity:

Minitron: Developed by NVIDIA, Minitron is a practical pruning and distillation method that effectively compresses models such as Llama 3.1 8B and Mistral NeMo 12B to smaller sizes (4B and 8B respectively) while maintaining performance. It involves exploring depth pruning and joint width pruning of various components, followed by distillation. Scripts within the pruning/minitron folder facilitate Minitron-style pruning on models like Llama 3.1 8B.

ShortGPT: This method operates on the premise that many Transformer layers exhibit redundancy. ShortGPT introduces "Block Influence," a metric for gauging each layer's contribution, subsequently removing less impactful layers through direct deletion. Experimental results indicate ShortGPT's superior performance over earlier pruning techniques across both multiple-choice and generative tasks. Its implementation is found within the Minitron folder, featuring a dedicated script for Llama 2-7B.

Wanda, SparseGPT, and Magnitude: These three post-training pruning methods are grouped together. Wanda evaluates weights by multiplying their magnitude with corresponding input activation on an output-by-output basis, pruning the lowest-scoring elements without requiring retraining. SparseGPT employs a second-order inspired reconstruction step to prune large GPT-style models at high sparsity ratios. Magnitude pruning serves as a foundational baseline, removing weights with minimal absolute values. All three reside under pruning/wanda, and the repository's documentation includes detailed comparison tables for Llama 2 7B across various benchmarks, covering both unstructured and structured sparsity patterns.

Sheared Llama: A structured pruning approach, Sheared LLaMA learns masks for layers, attention heads, and hidden dimensions, followed by retraining the optimized architecture. The pruning/llmshearing directory in the collection incorporates this recipe, employing a RedPajama subset for calibration, sourced through Hugging Face, alongside helper scripts for converting between Hugging Face and MosaicML Composer formats.

LLM-Pruner: This framework specializes in the structural pruning of large language models. It identifies and removes non-critical, coupled structures (e.g., attention heads or MLP channels) by leveraging gradient-based importance scores. Subsequent performance recovery is achieved through a brief LoRA tuning phase, often using around 50,000 samples. LLM-Pruner is included under pruning/LLM-Pruner, offering scripts compatible with LLaMA, LLaMA 2, and Llama 3.1 8B.

Key Contributions and Advantages

Unlock Efficiency: Zlab Princeton's Unified JAX Toolkit Revolutionizes LLM Compression

Inside the LLM-Pruning Collection

Featured Pruning Techniques

Key Contributions and Advantages

Latest News

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Palantir CEO Alex Karp: ICE Protesters Should Demand More AI Surveillance

More News

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

UAE Intelligence Chief's $500M Investment in Trump Crypto Venture Triggers Scrutiny Over AI Chip Deal

Exposed: The 'AI-Washing' Phenomenon Masking Traditional Layoffs

Unlock Efficiency: Zlab Princeton's Unified JAX Toolkit Revolutionizes LLM Compression

Inside the LLM-Pruning Collection

Featured Pruning Techniques

Key Contributions and Advantages

Latest News

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Palantir CEO Alex Karp: ICE Protesters Should Demand More AI Surveillance

More News

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

UAE Intelligence Chief's $500M Investment in Trump Crypto Venture Triggers Scrutiny Over AI Chip Deal

Exposed: The 'AI-Washing' Phenomenon Masking Traditional Layoffs

Unlock Efficiency: Zlab Princeton's Unified JAX Toolkit Revolutionizes LLM Compression

Inside the LLM-Pruning Collection

Featured Pruning Techniques

Key Contributions and Advantages

Latest News

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Palantir CEO Alex Karp: ICE Protesters Should Demand *More* AI Surveillance

More News

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

UAE Intelligence Chief's $500M Investment in Trump Crypto Venture Triggers Scrutiny Over AI Chip Deal

Exposed: The 'AI-Washing' Phenomenon Masking Traditional Layoffs

Unlock Efficiency: Zlab Princeton's Unified JAX Toolkit Revolutionizes LLM Compression

Inside the LLM-Pruning Collection

Featured Pruning Techniques

Key Contributions and Advantages

Latest News

From Political Chaos to Policy Crossroads: Albanese Navigates Shifting Sands

Historic Reimagining: Barnsley Crowned UK's First 'Tech Town' with Major Global Partnerships

OpenClaw: Viral AI Assistant's Autonomy Ignites Debate Amidst Expert Warnings

Adobe Sunsets Animate: A Generative AI Strategy Claims a Legacy Tool

Palantir CEO Alex Karp: ICE Protesters Should Demand *More* AI Surveillance

More News

East London Cafe Transforms Orders into Conversations, Fostering Connection Through British Sign Language

UAE Intelligence Chief's $500M Investment in Trump Crypto Venture Triggers Scrutiny Over AI Chip Deal

Exposed: The 'AI-Washing' Phenomenon Masking Traditional Layoffs

Palantir CEO Alex Karp: ICE Protesters Should Demand More AI Surveillance

Palantir CEO Alex Karp: ICE Protesters Should Demand More AI Surveillance