International Symposium on Computer Architecture / 2023

MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems

Samuel Hsia, Alicia Golden, Bilge Acun, Newsha Ardalani, Zach DeVito, Gu-Yeon Wei, David Brooks, Carole-Jean Wu

Foundation ModelsLarge Language ModelsML SystemsPopular and Landmark Papers

Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24 × for pretraining and up to 5.27 × for inference scenarios, respectively.

19 citations1 influential

Full paper

Read the original paper

Open PDF Source page

Learning resources

arXiv PDFPDF arXiv abstract pagearXiv Google Scholar referencesGoogle Scholar Papers with Code searchPapers with Code Semantic Scholar paper pageSemantic Scholar YouTube explanationsYouTube

Reading state

Discuss in ChatGPT

Uses your own ChatGPT account. The paper context is copied into a tutor prompt before ChatGPT opens.

Preview prompt

You are my AI/ML research paper instructor. I want to deeply understand the paper below.

First, teach it in layers:
1. One-paragraph intuition.
2. Problem statement and why it mattered.
3. Key method, architecture, or algorithm.
4. Important equations or mechanisms, explained intuitively.
5. Experiments and evidence.
6. Limitations, assumptions, and failure modes.
7. How this paper influenced later AI/ML/Deep Learning/GenAI work.
8. A 30-minute study plan with checkpoints.
9. Quiz me with 5 questions and wait for my answers.

When something is not available in the attached context, say what is missing and infer carefully.

### Paper attached as context
Title: MAD-Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems
Authors: Samuel Hsia, Alicia Golden, Bilge Acun, Newsha Ardalani, Zach DeVito, Gu-Yeon Wei, David Brooks, Carole-Jean Wu
Year: 2023
Venue: International Symposium on Computer Architecture
Categories: Foundation Models, Large Language Models, ML Systems, Popular and Landmark Papers
Citations: 19
Paper URL: https://arxiv.org/abs/2310.02784v3
Open PDF: https://arxiv.org/pdf/2310.02784v3

Abstract:
Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs. Our analysis, grounded in real-world large model training on datacenter-scale infrastructures, reveals that 14~32% of all GPU hours are spent on communication with no overlapping computation. To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max. This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities. Through the application of MAD-Max to a suite of real-world large-scale ML models on state-of-the-art GPU clusters, we showcase potential throughput enhancements of up to 2.24 × for pretraining and up to 5.27 × for inference scenarios, respectively.

Learning resources:
- PDF: arXiv PDF (https://arxiv.org/pdf/2310.02784v3)
- arXiv: arXiv abstract page (https://arxiv.org/abs/2310.02784v3)
- Google Scholar: Google Scholar references (https://scholar.google.com/scholar?q=MAD-Max%20Beyond%20Single-Node%3A%20Enabling%20Large%20Machine%20Learning%20Model%20Acceleration%20on%20Distributed%20Systems)
- Papers with Code: Papers with Code search (https://paperswithcode.com/search?q=MAD-Max%20Beyond%20Single-Node%3A%20Enabling%20Large%20Machine%20Learning%20Model%20Acceleration%20on%20Distributed%20Systems)
- Semantic Scholar: Semantic Scholar paper page (https://www.semanticscholar.org/paper/a399c4c0353cd05c81c25881d9e60cad9dc25721)
- YouTube: YouTube explanations (https://www.youtube.com/results?search_query=MAD-Max%20Beyond%20Single-Node%3A%20Enabling%20Large%20Machine%20Learning%20Model%20Acceleration%20on%20Distributed%20Systems+paper+explained)