Datacenter load balancing, especially in Clos topologies, remains a hot topic even after almost a decade. The pace of progress has picked up over the last few years with multiple solutions exploring different extremes of the solution space, ranging from edge-based to in-network solutions and using different granularities of load balancing: packets, flowcells, flowlets, or flows. Throughout all these efforts, load balancing granularity has always been either fixed or passively determined. This inflexibility and lack of control lead to many inefficiencies. In Hermes, we take a more active approach: we comprehensively sense path conditions and actively direct traffic to ensure cautious yet timely rerouting.
Production datacenters operate under various uncertainties such as traffic dynamics, topology asymmetry, and failures. Therefore, datacenter load balancing schemes must be resilient to these uncertainties; i.e., they should accurately sense path conditions and timely react to mitigate the fallouts. Despite significant efforts, prior solutions have important drawbacks. On the one hand, solutions such as Presto and DRB are oblivious to path conditions and blindly reroute at fixed granularity; on the other hand, solutions such as CONGA and CLOVE can sense congestion, but they can only reroute when flowlets emerge, thus cannot always react to uncertainties timely. To make things worse, these solutions fail to detect/handle failures such as blackholes or random packet drops, which greatly degrades their performance.
In this paper, we propose Hermes, a datacenter load balancer that is resilient to the aforementioned uncertainties. At its heart, Hermes leverages comprehensive sensing to detect path conditions including failures unattended before, and reacts by timely yet cautious rerouting. Hermes is a practical edge-based solution with no switch modification. We have implemented Hermes with commodity switches and evaluated it through both testbed experiments and large-scale simulations. Our results show that Hermes achieves comparable performance with CONGA and Presto in normal cases, and handles uncertainties well: under asymmetries, Hermes achieves up to 10% and 40% better flow completion time (FCT) than CONGA and CLOVE; under switch failures, it significantly outperforms all other schemes by over 50%.
While I have spent a lot of time working on datacenter network-related problems, my focus has always been on enabling application-awareness in the network using coflows and multi-tenancy issues; I have actively stayed away from lower level details. So when Hong and Kai brought up this load balancing problem after last SIGCOMM, I was a bit apprehensive. I became interested when they posed the problem as an interaction challenge between lower-level load balancing solutions and transport protocols, and I’m glad I got involved. As always, it’s been a pleasure working with Hong and Kai. I’ve been closely working with Hong for about two years now, leading to two first-authored SIGCOMM papers under his belt; at this point, I feel like his unofficial co-advisor. Junxue and Wei were instrumental in getting the experiments done in time.
This year the SIGCOMM PC accepted 36 papers out of 250 submissions with a 14.4% acceptance rate. The number of accepted papers went down and the number of submissions up, which led to the lowest acceptance rate since SIGCOMM 2012. This was also my first time on SIGCOMM PC.
Thanks so much Prof. Chowdhury! It has been a great pleasure working with you, and I feel so lucky to have your guidance:)