π Hi, I'm Hitesh Sahu
π AI Infrastructure Engineer β’ GPU Systems β’ Cloud Native β’ Open Source
π Portfolio: https://hiteshsahu.com
I build the infrastructure behind modern AI systems.
My work focuses on GPU clusters, Slurm, Kubernetes, distributed systems, observability, and developer tooling that makes AI workloads easier to run, benchmark, and debug.
I'm currently building an open-source ecosystem for AI infrastructure, including local Slurm clusters, GPU observability, HPC developer tools, and LLM benchmarking.
These projects complement each other and solve a piece of the puzzle in the AI workflow from training, inference, deployment to monitoring
| Project | Description | Tech |
|---|---|---|
ποΈ Model Gym![]() |
A fitness center for AI models. Import, export, optimize, benchmark, and report LLM inference performance across engines, runtimes, and hardware platforms. | Next.js β’ TypeScript β’ Python β’ PyTorch β’ Hugging Face |
π¦ RAG Factory![]() |
Transforms chaotic PDFs, documents, websites, databases, and APIs into trusted answers using embeddings, retrieval, reranking, and large language models. | Python β’ FastAPI β’ LangChain β’ Vector Databases β’ OpenAI β’ Ollama |
πΈ NVIDIA SuperPod![]() |
GPU Infrastructure Lab for building an AI supercomputer from commodity GPU servers. Explore multi-node training, networking, storage, scheduling, observability, and large-scale AI infrastructure. | Go β’ Kubernetes β’ Slurm β’ NVIDIA GPUs β’ InfiniBand β’ Prometheus |
Φ GPU Lens![]() |
Drop-in GPU + scheduler observability for clusters you already have. Get instant visibility into GPU health, utilization, memory, temperatures, ECC errors, XID faults, scheduler activity, and queue health. | Go β’ Prometheus β’ Grafana β’ DCGM Exporter β’ Kubernetes β’ Slurm |
π¦πΎ Squint![]() |
A GPU-aware Slurm monitor for your terminal. Read-only, zero-config, and runs anywhere. Visualize jobs, nodes, GPUs, queue health, and pending reasons through a fast terminal UI. | Go β’ Bubble Tea β’ Lip Gloss β’ Slurm β’ TUI |
πͺ Caravan![]() |
Spin up a complete local Slurm cluster with a single command. Develop, test, and submit HPC and AI workloads on Docker or Podman with GPU scheduling, making local experimentation fast and reproducible. | Go β’ Slurm β’ Docker β’ Podman β’ Cobra β’ HPC |
π GPU-Fabric-Bench![]() |
Reproducible RDMA fabric benchmarking suite for NCCL GPU collective communications on AWS EFA. Maps InfiniBand concepts to cloud-native HPC networking and visualizes latency, bandwidth, topology, and scaling behavior. | NCCL β’ AWS EFA β’ RDMA β’ MPI β’ NVIDIA GPUs β’ Python |
| GitHub Stats | Streak |
|---|---|
π hiteshsahu.com β’ π Stack Overflow (42k+)
```










