Skip to content

vishvaRam/AutoRound-Quantaization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

16 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AutoRound Quantization Toolkit

A compact collection of notebooks and scripts for weight-only quantization using iterative rounding techniques. This repository contains production-focused pipelines to quantize transformer models, export them in common formats, and run inference with minimal memory overhead.

What it does:

  • Purpose: Reduce model size and VRAM usage while preserving accuracy.
  • Approach: Iterative weight tuning with group-wise quantization and 16-bit activations.
  • Outputs: Exportable quantized checkpoints compatible with common inference tools.

Contents:

  • Notebooks: Ready-to-run Jupyter notebooks for quantization experiments.
  • Docs: Implementation notes and examples in the ReadMe/ folder.

Quickstart

  • Requirements: Python 3.8+, PyTorch (CUDA recommended for quantization), Jupyter.
  • Install (example):
pip install -r requirements.txt
# or
pip install transformers torch auto-round huggingface-hub
  • Run a notebook: Launch Jupyter and open the provided .ipynb files to reproduce quantization and export flows.

Usage (inference, generic)

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("<quantized-model-id>")
model = AutoModelForCausalLM.from_pretrained("<quantized-model-id>", device_map="auto")
inputs = tokenizer("Hello world", return_tensors="pt")
print(model.generate(**inputs, max_new_tokens=50))

Export formats: AutoRound-native, AWQ, GPTQ (format availability depends on the notebook and export scripts).

Where to look next:

  • Implementation notes and model-specific instructions: ReadMe/
  • Quantization notebooks (in root): files ending with .ipynb

License: Apache-2.0

For detailed tuning parameters, calibration notes, and export steps, see the markdown files in ReadMe/ and the notebooks in the repository.


Last updated: February 2026

AutoRound Quantization

Comprehensive model quantization project using Intel's AutoRound algorithm to create production-ready 4-bit quantized versions of language models. This project demonstrates advanced weight-only quantization techniques optimized for both text and vision-language models.

🎯 Project Overview

This repository contains automated quantization pipelines for various language models, quantized to W4A16 (4-bit weights, 16-bit activations) using Intel's AutoRound algorithm with extensive calibration and tuning for optimal accuracy retention.

πŸ“Š Quantization Details

The project produces quantized models in multiple formats optimized for different deployment scenarios.

πŸš€ Key Features

Advanced Quantization Configuration

TUNING_CONFIG = {
    "group_size": 128,              # Fine-grained quantization control
    "sym": True,                    # Symmetric quantization for better performance
    "iters": 600,                  # High-precision weight tuning
    "nsamples": 512,                # Extensive calibration samples
    "batch_size": 8,                # Optimized for high-end GPUs
    "seqlen": 2048,                 # Long context support
    "low_gpu_mem_usage": False,     # Keep on GPU for speed
    "enable_torch_compile": True,   # JIT compilation acceleration
    "quant_nontext_module": False   # Preserve vision tower accuracy (VLM only)
}

Quantization Scheme: W4A16

  • 4-bit weights: Reduced model size (~60% compression)
  • 16-bit activations: Maintained accuracy with reduced memory footprint
  • Symmetric quantization: Better hardware support and inference speed
  • Group size 128: Optimal balance between accuracy and efficiency

πŸ› οΈ Installation & Setup

Prerequisites

  • Python 3.8+
  • PyTorch with CUDA support (for GPU quantization)
  • High-end GPU with 48GB+ VRAM recommended (A40, A6000, L40, or similar)

Install Dependencies

# Install core packages
pip install --upgrade transformers
pip install auto-round
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# For Hugging Face Hub integration
pip install huggingface-hub

Quantization in Jupyter Notebooks

The project provides two ready-to-use notebooks:

  1. auto_round_Qwen_3_4B.ipynb - Quantize the 4B text model
  2. auto_round_Qwen_3_VL_8B.ipynb - Quantize the 8B vision-language model

Run these notebooks to:

  • Load base models from Hugging Face
  • Configure quantization parameters
  • Perform weight tuning (600 iterations)
  • Export to multiple formats (AutoRound, AWQ, GPTQ)
  • Push quantized models to Hugging Face Hub

πŸ’‘ Usage Examples

Load Quantized Model (AutoRound Format)

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("model_id")
model = AutoModelForCausalLM.from_pretrained(
    "model_id",
    device_map="auto",
    torch_dtype="auto"
)

# Inference
inputs = tokenizer("Your prompt here", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))

Load AWQ Format (Nvidia GPU Optimized)

from transformers import AutoModelForCausalLM, AutoTokenizer
from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_quantized("model_id", fuse_layers=True, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("model_id")

# Use with vLLM or TGI for production

πŸ“ˆ Performance Metrics

Quantization Accuracy

  • Iterations: 600 (production-grade weight tuning)
  • Calibration Samples: 512 (comprehensive accuracy preservation)
  • Compression Ratio: ~60% size reduction
  • Memory Savings: 4x reduction compared to full precision

Format Compatibility

Format Framework Platform Speed
AutoRound Auto-Round (Native) CPU/GPU High
AWQ vLLM, TGI, llama.cpp Nvidia GPU Very High
GPTQ llama.cpp, Ollama CPU/GPU High

πŸ”„ Export Formats

The quantization pipeline supports exporting to multiple formats:

  1. AutoRound Format - Native Intel format for maximum compatibility with auto-round
  2. AWQ Format - Optimized for Nvidia GPUs, best with vLLM and Text Generation Inference
  3. GPTQ Format - Broad compatibility across different inference frameworks

🀝 Integration with Hugging Face Hub

All quantized models are automatically pushed to Hugging Face Hub with full model cards and quantization details.

πŸ“š Documentation

For detailed information about quantization implementations, see the ReadMe/ folder for implementation details and examples.

πŸ”§ Technical Details

AutoRound Algorithm

AutoRound is an advanced weight-only quantization algorithm developed by Intel that:

  • Uses iterative weight tuning to minimize quantization error
  • Maintains activation precision at 16-bit for better accuracy
  • Supports group-wise quantization for fine-grained control
  • Achieves state-of-the-art accuracy on large language models

Why This Approach?

  • Accuracy: 600 iterations of weight tuning preserves model quality
  • Speed: Reduced model size enables faster inference
  • Memory: 4x memory reduction for deployment
  • Compatibility: Multiple export formats for different hardware
  • Production-Ready: Extensively calibrated with 512 samples

βš™οΈ System Requirements

For Quantization

  • GPU with 48GB+ VRAM (A40, A6000, L40, A100)
  • CUDA 11.8+
  • Python 3.8+
  • PyTorch compiled with CUDA support

For Inference

  • CPU or GPU depending on format
  • Minimal VRAM requirements for quantized models

πŸŽ“ Learning Resources

πŸ“ License

This project is released under Apache 2.0 license.

πŸ™ Acknowledgments

  • Intel for the AutoRound quantization algorithm
  • Hugging Face for the model hosting and transformers library

πŸ“ž Support & Contributions

For questions or contributions, review the quantization notebooks for implementation details and the ReadMe folder for documentation.


Last Updated: February 2026

Project Status: Production-Ready βœ…

About

Comprehensive model quantization project using Intel's AutoRound algorithm to create production-ready 4-bit quantized versions of language models. This project demonstrates advanced weight-only quantization techniques optimized for both text and vision-language models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors