End-to-End Network Design for Unified Memory Disaggregation

Applications in modern cloud datacenters are deployed in resource containers to isolate them from each other. Memory stranding is a pervasive problem in such containerized datacenters, where many memory-intensive applications grind to a halt even when free memory exists in other machines. This leads to low utilization, memory fragmentation, and overall increased cost. Memory disaggregation over ultra-fast networks can pool together such stranded memory in theory, but making it practical faces novel systems design, algorithmic, and integration challenges. They include bridging the still-sizable latency gap between local memory access vs. remote memory access; transparently addressing network-wide fault-tolerance, load imbalance, and performance isolation issues; scalability; and enabling support for heterogeneous software and hardware technologies.

The overarching research objective of this project is to realize a Unified Disaggregated Memory (UDM) abstraction over ultra-fast networks to expose stranded memory across the datacenter as a pool of available memory to out-of-memory containers in a fast, resilient, and scalable manner without any changes to the applications. By designing a comprehensive solution to address host-level, network-level, and end-to-end aspects of the aforementioned challenges, this research aims to make memory disaggregation practical. Specifically, by leveraging the unique characteristics of memory-intensive workloads, ultra-low-latency networks, and multi-tenancy in modern datacenters, this proposal will (i) design a low-latency host networking stack; (ii) enable performance isolation throughout the network; (iii) provide resilience to network-wide uncertainties such as failures and load imbalance; and (iv) incorporate support for heterogeneous memory, networking technologies, and resource management software.

People

Mosharaf Chowdhury (PI)
Dr. Juncheng Gu → ByteDance
Dr. Jie You → Meta
Dr. Hasan Al Maruf → AMD
Yiwen Zhang

Publications

Memory Disaggregation: Advances and Open Challenges, H. A. Maruf, M. Chowdhury, arXiv:2305.03943
TPP: Transparent Page Placement for CXL-Enabled Tiered Memory, H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. Kanaujia, P. Chauhan, ACM ASPLOS, 2023
Aequitas: Admission Control for Performance-Critical RPCs in Datacenters, Y. Zhang, G. Kumar, N. Dukkipati, X. Wu, P. Jha, M. Chowdhury, A. Vahdat, ACM SIGCOMM, 2022
TPP: Transparent page placement for CXL-Enabled tiered memory, H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. Kanaujia, P. Chauhan, arXiv:2206.02878
Justitia: Software Multi-Tenancy in Hardware Kernel-Bypass Networks, Y. Zhang, Y. Tan, B. Stephens, M. Chowdhury, USENIX NSDI, 2022
Hydra: Resilient and Highly Available Remote Memory, Y. Lee*, H. A. Maruf*, A. Cidon, M. Chowdhury, K. G. Shin, USENIX FAST, 2022 (*Equal contribution)
Memtrade: A Disaggregated-Memory Marketplace for Public Clouds, H. A. Maruf, Y. Zhong, H. Wang, M. Chowdhury, A. Cidon, C. Waldspurger, arXiv:2108.06893
Programmable Packet Scheduling with a Single Queue, Z. Yu, C. Hu, J. Wu, X. Sun, V. Braverman, M. Chowdhury, Z. Liu, X. Jin, ACM SIGCOMM, 2021
Ship Compute or Ship Data? Why Not Both?, J. You, J. Wu, X. Jin, M. Chowdhury, USENIX NSDI, 2021
Mitigating the Performance-Efficiency Tradeoff in Resilient Memory Disaggregation, Y. Lee, H. A. Maruf, M. Chowdhury, A. Cidon, K. G. Shin, arXiv:1910.09727
NetLock: Fast, Centralized Lock Management Using Programmable Switches, Z. Yu, Y. Zhang, V. Braverman, M. Chowdhury, X. Jin, ACM SIGCOMM, 2020
Effectively Prefetching Remote Memory with Leap, H. A. Maruf, M. Chowdhury, USENIX ATC, 2020 [hi]Best Paper Award[/hi]

Software

All software artifacts developed as part of this project are released as open-source with permissive licenses and can be found at https://github.com/SymbioticLab.

Workshop

NSF Workshop on Next-Gen Cloud Research Infrastructure

Outreach

K-12 students and educators can get involved in this project through the following resources.

Media

Support

This project is supported by a CAREER award from the National Science Foundation (CNS-1845853).