Earlier this week, I presented a keynote speech on the state of resource management for deep learning at the HotEdgeVideo’2020 workshop, covering our recent works on systems support for AI (Tiresias, AlloX, and Salus) and discussing open challenges in this space.
In this talk, I highlighted three unique aspects for resource management in the context of deep learning, which I think makes this setting unique even after decades of resource management works in CPUs, networks, and even Big Data clusters.
- Short-term predictability and long-term uncertainty of deep learning workloads: Deep learning workloads (hyperparameter tuning, training, and inference) may have different objectives, but they all share a common trend. Although we cannot know how many iterations will there be in a job or the number of requests coming to a model for inference, each iteration/request performs the same computation for a given job. Meaning, we can profile and then exploit that information for long-term benefits by adapting classic information-agnostic and information-limited scheduling techniques.
- Heterogeneous, interchangeable compute devices in deep learning clusters: Deep learning clusters becoming increasingly more diverse with many generations of GPUs and new hardware accelerators that our coming out every month. The key challenge in terms of resource management here is that all these compute devices are interchangeable (they all can compute), but they don’t compute at the same rate for all models. Some more suitable for CPUs, some for GPUs, some for TPUs, and so on. We need to rethink resource management algorithms to account for resource interchangeability.
- Black-box hardware accelerators: Deep learning hardware are also black boxes. Even for GPUs, we do not have any control over their internals; apart from some high-level information that are publicly available, we don’t know anything about what happens inside. For newer, vendor-locked accelerators, details are even more scarce. Consequently, resource management solutions should be designed to assume black-box hardware from the get go and then rely on profiling (by leveraging the iterative nature of deep learning) and short-term predictability to extract good performance using software techniques.
My slides from this talk are publicly available and have more details elaborating these points.