With the advancement of AI in recent years, GPUs have emerged as a popular choice for training deep learning (DL) models on large datasets. To deal with ever-growing datasets, it is also common to run distributed deep learning over multiple GPUs in parallel. Achieving cost-effectiveness and high performance in these clusters relies on efficiently sharing resources between multiple users. Unfortunately, most GPU clusters in production rely on resource managers designed for traditional big data analytics. This results in suboptimal performance and strong, but unnecessary, constraints. Tiresias is our first attempt at designing a GPU cluster resource manager that relies on profiling to make good scheduling and placement decisions with little or no input from the users.
Distributed training of deep learning (DL) models on GPU clusters is becoming increasingly more popular. Existing cluster managers face some unique challenges from DL training jobs, such as unpredictable training times, an all-or-nothing execution model, and inflexibility in GPU sharing. Our analysis of a large GPU cluster in production shows that existing big data schedulers — coupled with consolidated job placement constraint, whereby GPUs for the same job must be allocated in as few machines as possible — cause long queueing delays and low overall performance.
We present Tiresias, a GPU cluster resource manager tailored for distributed DL training jobs, which efficiently schedules and places DL jobs to reduce their job completion times (JCT). Given that a DL job’s execution time is often unpredictable, we propose two scheduling algorithms — Discretized Two-Dimensional Gittins Index relies on partial information and Discretized Two-Dimensional LAS is information-agnostic — that aim to minimize the average JCT. Additionally, we describe when the consolidated placement constraint can be relaxed and present a placement algorithm to leverage these observations without any user input. Experiments on a cluster with 60 P100 GPUs — and large-scale trace-driven simulations — show that Tiresias improves the average JCT by up to 5.5X over an Apache YARN-based resource manager used in production. More importantly, Tiresias’s performance is comparable to that of solutions assuming perfect knowledge.
This is Juncheng’s second NSDI paper after Infiniswap in NSDI’17, and a very proud moment for me as his advisor. I would like to thank all our collaborators. I would also like to thank Samir and Barna for inviting me to the TTIC Summer Workshop on Datacenter Scheduling, where I heard Mor’s talk on SOAP that inspired our use of Gittins Index-based scheduling to this context in the partial information case. The application of LAS was inspired by my earlier work on information-agnostic coflow scheduling.
This year NSDI PC accepted 49 out of 332 submissions across Spring (19/92) and Fall (30/240) deadlines for a somewhat lower acceptance rate in comparison to recent years.