A compact collection of notebooks and scripts for weight-only quantization using iterative rounding techniques. This repository contains production-focused pipelines to quantize transformer models, export them in common formats, and run inference with minimal memory overhead.
What it does:
- Purpose: Reduce model size and VRAM usage while preserving accuracy.
- Approach: Iterative weight tuning with group-wise quantization and 16-bit activations.
- Outputs: Exportable quantized checkpoints compatible with common inference tools.
Contents:
- Notebooks: Ready-to-run Jupyter notebooks for quantization experiments.
- Docs: Implementation notes and examples in the
ReadMe/folder.
Quickstart
- Requirements: Python 3.8+, PyTorch (CUDA recommended for quantization), Jupyter.
- Install (example):
pip install -r requirements.txt
# or
pip install transformers torch auto-round huggingface-hub
- Run a notebook: Launch Jupyter and open the provided
.ipynbfiles to reproduce quantization and export flows.
Usage (inference, generic)
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("<quantized-model-id>")
model = AutoModelForCausalLM.from_pretrained("<quantized-model-id>", device_map="auto")
inputs = tokenizer("Hello world", return_tensors="pt")
print(model.generate(**inputs, max_new_tokens=50))
Export formats: AutoRound-native, AWQ, GPTQ (format availability depends on the notebook and export scripts).
Where to look next:
- Implementation notes and model-specific instructions: ReadMe/
- Quantization notebooks (in root): files ending with
.ipynb
License: Apache-2.0
For detailed tuning parameters, calibration notes, and export steps, see the markdown files in ReadMe/ and the notebooks in the repository.
Last updated: February 2026
Comprehensive model quantization project using Intel's AutoRound algorithm to create production-ready 4-bit quantized versions of language models. This project demonstrates advanced weight-only quantization techniques optimized for both text and vision-language models.
This repository contains automated quantization pipelines for various language models, quantized to W4A16 (4-bit weights, 16-bit activations) using Intel's AutoRound algorithm with extensive calibration and tuning for optimal accuracy retention.
The project produces quantized models in multiple formats optimized for different deployment scenarios.
TUNING_CONFIG = {
"group_size": 128, # Fine-grained quantization control
"sym": True, # Symmetric quantization for better performance
"iters": 600, # High-precision weight tuning
"nsamples": 512, # Extensive calibration samples
"batch_size": 8, # Optimized for high-end GPUs
"seqlen": 2048, # Long context support
"low_gpu_mem_usage": False, # Keep on GPU for speed
"enable_torch_compile": True, # JIT compilation acceleration
"quant_nontext_module": False # Preserve vision tower accuracy (VLM only)
}- 4-bit weights: Reduced model size (~60% compression)
- 16-bit activations: Maintained accuracy with reduced memory footprint
- Symmetric quantization: Better hardware support and inference speed
- Group size 128: Optimal balance between accuracy and efficiency
- Python 3.8+
- PyTorch with CUDA support (for GPU quantization)
- High-end GPU with 48GB+ VRAM recommended (A40, A6000, L40, or similar)
# Install core packages
pip install --upgrade transformers
pip install auto-round
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# For Hugging Face Hub integration
pip install huggingface-hubThe project provides two ready-to-use notebooks:
- auto_round_Qwen_3_4B.ipynb - Quantize the 4B text model
- auto_round_Qwen_3_VL_8B.ipynb - Quantize the 8B vision-language model
Run these notebooks to:
- Load base models from Hugging Face
- Configure quantization parameters
- Perform weight tuning (600 iterations)
- Export to multiple formats (AutoRound, AWQ, GPTQ)
- Push quantized models to Hugging Face Hub
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("model_id")
model = AutoModelForCausalLM.from_pretrained(
"model_id",
device_map="auto",
torch_dtype="auto"
)
# Inference
inputs = tokenizer("Your prompt here", return_tensors="pt")
outputs = model.generate(**inputs, max_length=100)
print(tokenizer.decode(outputs[0]))from transformers import AutoModelForCausalLM, AutoTokenizer
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_quantized("model_id", fuse_layers=True, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("model_id")
# Use with vLLM or TGI for production- Iterations: 600 (production-grade weight tuning)
- Calibration Samples: 512 (comprehensive accuracy preservation)
- Compression Ratio: ~60% size reduction
- Memory Savings: 4x reduction compared to full precision
| Format | Framework | Platform | Speed |
|---|---|---|---|
| AutoRound | Auto-Round (Native) | CPU/GPU | High |
| AWQ | vLLM, TGI, llama.cpp | Nvidia GPU | Very High |
| GPTQ | llama.cpp, Ollama | CPU/GPU | High |
The quantization pipeline supports exporting to multiple formats:
- AutoRound Format - Native Intel format for maximum compatibility with auto-round
- AWQ Format - Optimized for Nvidia GPUs, best with vLLM and Text Generation Inference
- GPTQ Format - Broad compatibility across different inference frameworks
All quantized models are automatically pushed to Hugging Face Hub with full model cards and quantization details.
For detailed information about quantization implementations, see the ReadMe/ folder for implementation details and examples.
AutoRound is an advanced weight-only quantization algorithm developed by Intel that:
- Uses iterative weight tuning to minimize quantization error
- Maintains activation precision at 16-bit for better accuracy
- Supports group-wise quantization for fine-grained control
- Achieves state-of-the-art accuracy on large language models
- Accuracy: 600 iterations of weight tuning preserves model quality
- Speed: Reduced model size enables faster inference
- Memory: 4x memory reduction for deployment
- Compatibility: Multiple export formats for different hardware
- Production-Ready: Extensively calibrated with 512 samples
- GPU with 48GB+ VRAM (A40, A6000, L40, A100)
- CUDA 11.8+
- Python 3.8+
- PyTorch compiled with CUDA support
- CPU or GPU depending on format
- Minimal VRAM requirements for quantized models
This project is released under Apache 2.0 license.
- Intel for the AutoRound quantization algorithm
- Hugging Face for the model hosting and transformers library
For questions or contributions, review the quantization notebooks for implementation details and the ReadMe folder for documentation.
Last Updated: February 2026
Project Status: Production-Ready β