International Conference for High Performance Computing, Networking, Storage and Analysis / 2021

Clairvoyant Prefetching for Distributed Machine Learning I/O

Roman Böhringer, Nikoli Dryden, Tal Ben-Nun, T. Hoefler

Computer VisionLarge Language ModelsML SystemsPopular and Landmark Papers

I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing this I/O bottle-neck necessitates careful optimization, as optimal data ingestion pipelines differ between systems, and require a delicate balance between access to local storage, external filesystems, and remote nodes. We introduce NoPFS, a machine learning I/O middleware, which provides a scalable, flexible, and easy-to-use solution to the I/O bottleneck. NoPFS uses clairvoyance: Given the seed generating the random access pattern for training with SGD, it can exactly predict when and where a sample will be accessed. We combine this with an analysis of access patterns and a performance model to provide distributed caching policies that adapt to different datasets and storage hierarchies. NoPFS reduces I/O times and improves end-to-end training by up to 5.4× on the ImageNet-1k, ImageNet-22k, and CosmoFlow datasets.

72 citations8 influential

Full paper

Read the original paper

Open PDF Source page

Learning resources

arXiv PDFPDF arXiv abstract pagearXiv Google Scholar referencesGoogle Scholar Papers with Code searchPapers with Code Semantic Scholar paper pageSemantic Scholar YouTube explanationsYouTube

Reading state

Discuss in ChatGPT

Uses your own ChatGPT account. The paper context is copied into a tutor prompt before ChatGPT opens.

Preview prompt

You are my AI/ML research paper instructor. I want to deeply understand the paper below.

First, teach it in layers:
1. One-paragraph intuition.
2. Problem statement and why it mattered.
3. Key method, architecture, or algorithm.
4. Important equations or mechanisms, explained intuitively.
5. Experiments and evidence.
6. Limitations, assumptions, and failure modes.
7. How this paper influenced later AI/ML/Deep Learning/GenAI work.
8. A 30-minute study plan with checkpoints.
9. Quiz me with 5 questions and wait for my answers.

When something is not available in the attached context, say what is missing and infer carefully.

### Paper attached as context
Title: Clairvoyant Prefetching for Distributed Machine Learning I/O
Authors: Roman Böhringer, Nikoli Dryden, Tal Ben-Nun, T. Hoefler
Year: 2021
Venue: International Conference for High Performance Computing, Networking, Storage and Analysis
Categories: Computer Vision, Large Language Models, ML Systems, Popular and Landmark Papers
Citations: 72
Paper URL: https://arxiv.org/abs/2101.08734
Open PDF: https://arxiv.org/pdf/2101.08734

Abstract:
I/O is emerging as a major bottleneck for machine learning training, especially in distributed environments. Indeed, at large scale, I/O takes as much as 85% of training time. Addressing this I/O bottle-neck necessitates careful optimization, as optimal data ingestion pipelines differ between systems, and require a delicate balance between access to local storage, external filesystems, and remote nodes. We introduce NoPFS, a machine learning I/O middleware, which provides a scalable, flexible, and easy-to-use solution to the I/O bottleneck. NoPFS uses clairvoyance: Given the seed generating the random access pattern for training with SGD, it can exactly predict when and where a sample will be accessed. We combine this with an analysis of access patterns and a performance model to provide distributed caching policies that adapt to different datasets and storage hierarchies. NoPFS reduces I/O times and improves end-to-end training by up to 5.4× on the ImageNet-1k, ImageNet-22k, and CosmoFlow datasets.

Learning resources:
- PDF: arXiv PDF (https://arxiv.org/pdf/2101.08734)
- arXiv: arXiv abstract page (https://arxiv.org/abs/2101.08734)
- Google Scholar: Google Scholar references (https://scholar.google.com/scholar?q=Clairvoyant%20Prefetching%20for%20Distributed%20Machine%20Learning%20I%2FO)
- Papers with Code: Papers with Code search (https://paperswithcode.com/search?q=Clairvoyant%20Prefetching%20for%20Distributed%20Machine%20Learning%20I%2FO)
- Semantic Scholar: Semantic Scholar paper page (https://www.semanticscholar.org/paper/c4547ea56633ce91599ae5880163c6c276d6529c)
- YouTube: YouTube explanations (https://www.youtube.com/results?search_query=Clairvoyant%20Prefetching%20for%20Distributed%20Machine%20Learning%20I%2FO+paper+explained)