While training and inference of deep learning models have received significant attention in recent years (e.g., Tiresias, AlloX, and Salus from our group), hyperparamter tuning is often overlooked or put together in the same bucket of optimizations as training. Existing hyperparameter tuning solutions, primarily from the ML research community, are mostly resource-agnostic. More importantly, even if they try to use up all available resources, existing solutions do not distinguish between the throughput of a GPU (how much work a GPU is doing) and its goodput (how much of that is ultimately useful work) during hyperparameter tuning. Fluid is our attempt at bridging the gap between hyperparameter tuning algorithms and the underlying cluster resources by improving both intra- and inter-GPU goodput in large clusters.
Current hyperparameter tuning solutions lack complementary execution engines to efficiently leverage distributed computation, thus ignoring the possibility of intra- and inter-GPU sharing, which exhibits poor resource usage. In this paper, we present FluidExec, a generalized hyperparameter tuning execution engine, that coordinates between hyperparameter tuning jobs and cluster resources. FluidExec schedules evaluation trials in such jobs using a water-filling approach to make the best use of resources both at intra- and inter-GPU granularities to speed up the tuning process. By abstracting a hyperparameter tuning job as a sequence of TrialGroup, FluidExec can boost the performance of diverse hyperparameter tuning solutions. Our experiments show that FluidExec can speed up synchronous BOHB by 200%, and BOHB and ASHA by 30% while having similar final accuracy.
Fluid is a joint project between Peifeng and Jiachen, which started right after Salus and before Jiachen started her Ph.D.! I’m super excited about many future works in the Systems + AI area from SymbioticLab members.