While GPUs are always in the news when it comes to deep learning clusters (e.g., Salus or Tiresias), we are in the midst of an emergence of many more computing devices (e.g., FPGAs and problem-specific accelerators), including the traditional CPUs. All of them are compute devices, but one cannot expect the same speed from all of them for all types of computations. A natural question is, therefore, how to allocate such interchangeable resources in a hybrid cluster? AlloX is our attempt at a reasonable answer.
Modern deep learning frameworks support a variety of hardware, including CPU, GPU, and other accelerators, to perform computation. In this paper, we study how to schedule jobs over such interchangeable resources – each with a different rate of computation – to optimize performance while providing fairness among users in a shared cluster. We demonstrate theoretically and empirically that existing solutions and their straightforward modifications perform poorly in the presence of interchangeable resources, which motivates the design and implementation of AlloX. At its core, AlloX transforms the scheduling problem into a min-cost bipartite matching problem and provides dynamic fair allocation over time. We theoretically prove its optimality in an ideal, offline setting and show empirically that it works well in the online scenario by incorporating with Kubernetes. Evaluations on a small-scale CPU-GPU hybrid cluster and large-scale simulations highlight that AlloX can reduce the average job completion time significantly (by up to 95% when the system load is high) while providing fairness and preventing starvation.
AlloX has been in the oven for more than two years, and it is a testament to Tan and Xiao’s tenacity. It’s also my second paper with Zhenhua after HUG, our first joint-work published based on our recent NSF project, and my first paper at EuroSys. I believe the results in this paper will give rise to new analyses of many systems that have interchangeable resources, such as DRAM-NVM hybrid systems or the storage/caching hierarchy.
This year the EuroSys PC accepted 43 out of 234 submissions.
One thought on “AlloX Accepted to Appear at EuroSys’2020”
Comments are closed.