As cloud providers deploy RDMA in their datacenters and developers rewrite/update their applications to use RDMA primitives, a key question remains open: what will happen when multiple RDMA-enabled applications must share the network? Surprisingly, this simple question does not yet have a conclusive answer. This is because existing work focus primarily on improving individual application’s performance using RDMA on a case-by-case basis. So we took a simple step. Yiwen built a benchmarking tool and ran some controlled experiments to find some very interesting results: there is often no performance isolation, and there are many factors that determine RDMA performance beyond the network itself. We are currently investigating the root causes and working on mitigating them.
To meet the increasing throughput and latency demands of modern applications, many operators are rapidly deploying RDMA in their datacenters. At the same time, developers are re-designing their software to take advantage of RDMA’s benefits for individual applications. However, when it comes to RDMA’s performance, many simple questions remain open.
In this paper, we consider the performance isolation characteristics of RDMA. Specifically, we conduct three sets of experiments — three combinations of one throughput-sensitive flow and one latency-sensitive flow — in a controlled environment, observe large discrepancies in RDMA performance with and without the presence of a competing flow, and describe our progress in identifying plausible root-causes.
This work is an offshoot, among several others, of our Infiniswap project on rack-scale memory disaggregation. As time goes by, I’m appreciating more and more Ion’s words of wisdom on building real systems to find real problems.
FTR, the KBNets PC accepted 9 papers out of 22 submissions this year. For a first-time workshop, I consider it a decent rate; there are some great papers in the program from both the industry and academia.