AutoRound Quantization Toolkit

A compact collection of notebooks and scripts for weight-only quantization using iterative rounding techniques. This repository contains production-focused pipelines to quantize transformer models, export them in common formats, and run inference with minimal memory overhead.

What it does:

Purpose: Reduce model size and VRAM usage while preserving accuracy.
Approach: Iterative weight tuning with group-wise quantization and 16-bit activations.
Outputs: Exportable quantized checkpoints compatible with common inference tools.

Contents:

Notebooks: Ready-to-run Jupyter notebooks for quantization experiments.
Docs: Implementation notes and examples in the ReadMe/ folder.

Quickstart

Requirements: Python 3.8+, PyTorch (CUDA recommended for quantization), Jupyter.
Install (example):

pip install -r requirements.txt
# or
pip install transformers torch auto-round huggingface-hub

Run a notebook: Launch Jupyter and open the provided .ipynb files to reproduce quantization and export flows.

Usage (inference, generic)

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("<quantized-model-id>")
model = AutoModelForCausalLM.from_pretrained("<quantized-model-id>", device_map="auto")
inputs = tokenizer("Hello world", return_tensors="pt")
print(model.generate(**inputs, max_new_tokens=50))

Export formats: AutoRound-native, AWQ, GPTQ (format availability depends on the notebook and export scripts).

Where to look next:

Implementation notes and model-specific instructions: ReadMe/
Quantization notebooks (in root): files ending with .ipynb

License: Apache-2.0

For detailed tuning parameters, calibration notes, and export steps, see the markdown files in ReadMe/ and the notebooks in the repository.

Last updated: February 2026

AutoRound Quantization

Comprehensive model quantization project using Intel's AutoRound algorithm to create production-ready 4-bit quantized versions of language models. This project demonstrates advanced weight-only quantization techniques optimized for both text and vision-language models.

🎯 Project Overview

This repository contains automated quantization pipelines for various language models, quantized to W4A16 (4-bit weights, 16-bit activations) using Intel's AutoRound algorithm with extensive calibration and tuning for optimal accuracy retention.

📊 Quantization Details

The project produces quantized models in multiple formats optimized for different deployment scenarios.

🚀 Key Features

Advanced Quantization Configuration

TUNING_CONFIG = {
    "group_size": 128,              # Fine-grained quantization control
    "sym": True,                    # Symmetric quantization for better performance
    "iters": 600,                  # High-precision weight tuning
    "nsamples": 512,                # Extensive calibration samples
    "batch_size": 8,                # Optimized for high-end GPUs
    "seqlen": 2048,                 # Long context support
    "low_gpu_mem_usage": False,     # Keep on GPU for speed
    "enable_torch_compile": True,   # JIT compilation acceleration
    "quant_nontext_module": False   # Preserve vision tower accuracy (VLM only)
}

Quantization Scheme: W4A16

4-bit weights: Reduced model size (~60% compression)
16-bit activations: Maintained accuracy with reduced memory footprint
Symmetric quantization: Better hardware support and inference speed
Group size 128: Optimal balance between accuracy and efficiency

🛠️ Installation & Setup

Prerequisites

Python 3.8+
PyTorch with CUDA support (for GPU quantization)
High-end GPU with 48GB+ VRAM recommended (A40, A6000, L40, or similar)

Install Dependencies

# Install core packages
pip install --upgrade transformers
pip install auto-round
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# For Hugging Face Hub integration
pip install huggingface-hub

Quantization in Jupyter Notebooks

The project provides two ready-to-use notebooks:

auto_round_Qwen_3_4B.ipynb - Quantize the 4B text model
auto_round_Qwen_3_VL_8B.ipynb - Quantize the 8B vision-language model

Run these notebooks to:

Load base models from Hugging Face
Configure quantization parameters
Perform weight tuning (600 iterations)
Export to multiple formats (AutoRound, AWQ, GPTQ)
Push quantized models to Hugging Face Hub

💡 Usage Examples

Load Quantized Model (AutoRound Format)

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("model_id")
model = AutoModelForCausalLM.from_pretrained(
    "model_id",
    device_map="auto",
    torch_dtype="auto"
)

# Inference
inputs = tokenizer("Your prompt here", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))

Load AWQ Format (Nvidia GPU Optimized)

from transformers import AutoModelForCausalLM, AutoTokenizer
from awq import AutoAWQForCausalLM

model = AutoAWQForCausalLM.from_quantized("model_id", fuse_layers=True, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("model_id")

# Use with vLLM or TGI for production

📈 Performance Metrics

Quantization Accuracy

Iterations: 600 (production-grade weight tuning)
Calibration Samples: 512 (comprehensive accuracy preservation)
Compression Ratio: ~60% size reduction
Memory Savings: 4x reduction compared to full precision

Format Compatibility

Format	Framework	Platform	Speed
AutoRound	Auto-Round (Native)	CPU/GPU	High
AWQ	vLLM, TGI, llama.cpp	Nvidia GPU	Very High
GPTQ	llama.cpp, Ollama	CPU/GPU	High

🔄 Export Formats

The quantization pipeline supports exporting to multiple formats:

AutoRound Format - Native Intel format for maximum compatibility with auto-round
AWQ Format - Optimized for Nvidia GPUs, best with vLLM and Text Generation Inference
GPTQ Format - Broad compatibility across different inference frameworks

🤝 Integration with Hugging Face Hub

All quantized models are automatically pushed to Hugging Face Hub with full model cards and quantization details.

📚 Documentation

For detailed information about quantization implementations, see the ReadMe/ folder for implementation details and examples.

🔧 Technical Details

AutoRound Algorithm

AutoRound is an advanced weight-only quantization algorithm developed by Intel that:

Uses iterative weight tuning to minimize quantization error
Maintains activation precision at 16-bit for better accuracy
Supports group-wise quantization for fine-grained control
Achieves state-of-the-art accuracy on large language models

Why This Approach?

Accuracy: 600 iterations of weight tuning preserves model quality
Speed: Reduced model size enables faster inference
Memory: 4x memory reduction for deployment
Compatibility: Multiple export formats for different hardware
Production-Ready: Extensively calibrated with 512 samples

⚙️ System Requirements

For Quantization

GPU with 48GB+ VRAM (A40, A6000, L40, A100)
CUDA 11.8+
Python 3.8+
PyTorch compiled with CUDA support

For Inference

CPU or GPU depending on format
Minimal VRAM requirements for quantized models

🎓 Learning Resources

📝 License

This project is released under Apache 2.0 license.

🙏 Acknowledgments

Intel for the AutoRound quantization algorithm
Hugging Face for the model hosting and transformers library

📞 Support & Contributions

For questions or contributions, review the quantization notebooks for implementation details and the ReadMe folder for documentation.

Last Updated: February 2026

Project Status: Production-Ready ✅

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Gemma4		Gemma4
Qwen3-VL		Qwen3-VL
Qwen3.5		Qwen3.5
Qwen3		Qwen3
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

AutoRound Quantization Toolkit

AutoRound Quantization

🎯 Project Overview

📊 Quantization Details

🚀 Key Features

Advanced Quantization Configuration

Quantization Scheme: W4A16

🛠️ Installation & Setup

Prerequisites

Install Dependencies

Quantization in Jupyter Notebooks

💡 Usage Examples

Load Quantized Model (AutoRound Format)

Load AWQ Format (Nvidia GPU Optimized)

📈 Performance Metrics

Quantization Accuracy

Format Compatibility

🔄 Export Formats

🤝 Integration with Hugging Face Hub

📚 Documentation

🔧 Technical Details

AutoRound Algorithm

Why This Approach?

⚙️ System Requirements

For Quantization

For Inference

🎓 Learning Resources

📝 License

🙏 Acknowledgments

📞 Support & Contributions

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages