Friday, March 6, 2026 | ๐Ÿ”ฅ trending
๐Ÿ”ฅ
TrustMeBro
news that hits different ๐Ÿ’…
๐Ÿค– ai

AI in Multiple GPUs: ZeRO FSDP

Learn how Zero Redundancy Optimizer works, how to implement it from scratch, and how to use it in PyTorch The post AI in Multiple GPUs: Z...

โœ๏ธ
main character energy ๐Ÿ’ซ
Friday, March 6, 2026 ๐Ÿ“– 2 min read
AI in Multiple GPUs: ZeRO FSDP
Image: Towards Data Science

Whatโ€™s Happening

So basically Learn how Zero Redundancy Optimizer works, how to implement it from scratch, and how to use it in PyTorch The post AI in Multiple GPUs: ZeRO & FSDP appeared first on Towards Data Science.

This article is part of a series about distributed AI across multiple GPUs: Part 1: Understanding the Host and Device Paradigm Part 2: Point-to-Point and Collective Operations Part 3: How GPUs Communicate Part 4: Gradient Accumulation & Distributed Data Parallelism (DDP) Part 5: ZeRO (this article) Part 6: Tensor Parallelism (coming soon) Introduction In the previous post, we saw how Distributed Data Parallelism (DDP) speeds up training across GPUs. DDP solves the throughput problem, but it introduces a new challenge: memory redundancy . (yes, really)

In vanilla DDP, every GPU holds a complete copy of the model parameters, gradients, and optimizer states.

The Details

For large models like GPT-3 (175B parameters), this redundancy becomes a big waste of precious VRAM. Image by author: Model, gradients and optimizer are redundant across GPUs in regular DDP ZeRO (Zero Redundancy Optimizer) solves this.

There are three levels: ZeRO-1 partitions only optimizer states ZeRO-2 partitions optimizer states + gradients ZeRO-3 partitions optimizer states + gradients + model parameters ZeRO isnโ€™t a parallelism technique because all GPUs still run the same forward and backward passes. Itโ€™s a memory optimization strategy that eliminates redundancy across GPUs, letting you train larger models on the same hardware.

Why This Matters

The Memory Problem in DDP Letโ€™s break down what actually consumes memory during training. For a model with parameters: Model Parameters : values (the weights of your neural network) Gradients : values (one gradient per parameter) Optimizer States (Adam) : values (first moment and second moment for each parameter) Activations : Intermediate outputs stored during forward pass for use in backward pass The first three grow with model size and are redundant across GPUs in DDP. Activations grow with batch size, sequence length, and # neurons, and are unique per GPU since each GPU processes different data.

As AI capabilities expand, weโ€™re seeing more announcements like this reshape the industry.

The Bottom Line

Activations grow with batch size, sequence length, and # neurons, and are unique per GPU since each GPU processes different data. ZeRO doesnโ€™t touch activation memory.

We want to hear your thoughts on this.

โœจ

Originally reported by Towards Data Science

Got a question about this? ๐Ÿค”

Ask anything about this article and get an instant answer.

Answers are AI-generated based on the article content.

vibe check:

more like this ๐Ÿ‘€