Skip to content
View hiteshsahu's full-sized avatar
πŸ“ˆ
Trying to make a difference
πŸ“ˆ
Trying to make a difference

Block or report hiteshsahu

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
hiteshsahu/README.md

πŸ‘‹ Hi, I'm Hitesh Sahu

πŸš€ AI Infrastructure Engineer β€’ GPU Systems β€’ Cloud Native β€’ Open Source

🌍 Portfolio: https://hiteshsahu.com

I build the infrastructure behind modern AI systems.

My work focuses on GPU clusters, Slurm, Kubernetes, distributed systems, observability, and developer tooling that makes AI workloads easier to run, benchmark, and debug.

I'm currently building an open-source ecosystem for AI infrastructure, including local Slurm clusters, GPU observability, HPC developer tools, and LLM benchmarking.


πŸ† Certifications

πŸŽ“ View all on Credly β†’

AI Infrastructure Ecosystem

These projects complement each other and solve a piece of the puzzle in the AI workflow from training, inference, deployment to monitoring

Project Description Tech
πŸ‹οΈ Model Gym

A fitness center for AI models. Import, export, optimize, benchmark, and report LLM inference performance across engines, runtimes, and hardware platforms. Next.js β€’ TypeScript β€’ Python β€’ PyTorch β€’ Hugging Face
πŸ¦† RAG Factory

Transforms chaotic PDFs, documents, websites, databases, and APIs into trusted answers using embeddings, retrieval, reranking, and large language models. Python β€’ FastAPI β€’ LangChain β€’ Vector Databases β€’ OpenAI β€’ Ollama
🐸 NVIDIA SuperPod

GPU Infrastructure Lab for building an AI supercomputer from commodity GPU servers. Explore multi-node training, networking, storage, scheduling, observability, and large-scale AI infrastructure. Go β€’ Kubernetes β€’ Slurm β€’ NVIDIA GPUs β€’ InfiniBand β€’ Prometheus
֎ GPU Lens

Drop-in GPU + scheduler observability for clusters you already have. Get instant visibility into GPU health, utilization, memory, temperatures, ECC errors, XID faults, scheduler activity, and queue health. Go β€’ Prometheus β€’ Grafana β€’ DCGM Exporter β€’ Kubernetes β€’ Slurm
🦝🐾 Squint

A GPU-aware Slurm monitor for your terminal. Read-only, zero-config, and runs anywhere. Visualize jobs, nodes, GPUs, queue health, and pending reasons through a fast terminal UI. Go β€’ Bubble Tea β€’ Lip Gloss β€’ Slurm β€’ TUI
πŸͺ Caravan

Spin up a complete local Slurm cluster with a single command. Develop, test, and submit HPC and AI workloads on Docker or Podman with GPU scheduling, making local experimentation fast and reproducible. Go β€’ Slurm β€’ Docker β€’ Podman β€’ Cobra β€’ HPC
πŸ›œ GPU-Fabric-Bench

Reproducible RDMA fabric benchmarking suite for NCCL GPU collective communications on AWS EFA. Maps InfiniBand concepts to cloud-native HPC networking and visualizes latency, bandwidth, topology, and scaling behavior. NCCL β€’ AWS EFA β€’ RDMA β€’ MPI β€’ NVIDIA GPUs β€’ Python

πŸ“Š Open Source Activity

GitHub Stats Streak

πŸ“ˆ Contribution Graph


βš™οΈ Tech Stack

πŸ€– AI & Machine Learning

PyTorch Hugging Face LangChain Ollama OpenAI

⚑ GPU Computing & HPC

CUDA NVIDIA Slurm NCCL RDMA InfiniBand

πŸ›  Backend & Distributed Systems

Java Spring Boot Quarkus FastAPI Kafka PostgreSQL

☸️ Cloud Native

Go Kubernetes Helm Terraform Docker Podman AWS

πŸ“Š Observability

Prometheus Grafana OpenTelemetry DCGM

🎨 Frontend

React Next.js TypeScript


GitHub LinkedIn Email X

🌍 hiteshsahu.com β€’ πŸ“š Stack Overflow (42k+)

```

Pinned Loading

  1. ECommerce-App-Android ECommerce-App-Android Public

    E-Commerce App for Android with Material Design Pattern

    Java 598 479

  2. GPU-Fabric-Bench GPU-Fabric-Bench Public

    A reproducible benchmark suite for NCCL GPU collective communications over RDMA fabric, targeting AI/HPC workloads.

    Python 2

  3. Nvidia-Super-Pod Nvidia-Super-Pod Public

    Custom self hosted AWS GPU cluster with Ansible and Kubernates for ML workload

    HCL 1

  4. SeeFood-App-master SeeFood-App-master Public

    See Food App Inspired from Silicon Valley TV Series

    C++ 1

  5. GPU-Lens GPU-Lens Public

    Drop-in GPU + scheduler observability for clusters you already have.

    Shell 1

  6. squint squint Public

    A GPU-aware TUI for SLURM.

    Go 1