Applications in modern cloud datacenters are deployed in resource containers to isolate them from each other. Memory stranding is a pervasive problem in such containerized datacenters, where many memory-intensive applications grind to a halt even when free memory exists in other machines. This leads to low utilization, memory fragmentation, and overall increased cost. Memory disaggregation over ultra-fast networks can pool together such stranded memory in theory, but making it practical faces novel systems design, algorithmic, and integration challenges. They include bridging the still-sizable latency gap between local memory access vs. remote memory access; transparently addressing network-wide fault-tolerance, load imbalance, and performance isolation issues; scalability; and enabling support for heterogeneous software and hardware technologies.
The overarching research objective of this project is to realize a Unified Disaggregated Memory (UDM) abstraction over ultra-fast networks to expose stranded memory across the datacenter as a pool of available memory to out-of-memory containers in a fast, resilient, and scalable manner without any changes to the applications. By designing a comprehensive solution to address host-level, network-level, and end-to-end aspects of the aforementioned challenges, this research aims to make memory disaggregation practical. Specifically, by leveraging the unique characteristics of memory-intensive workloads, ultra-low-latency networks, and multi-tenancy in modern datacenters, this proposal will (i) design a low-latency host networking stack; (ii) enable performance isolation throughout the network; (iii) provide resilience to network-wide uncertainties such as failures and load imbalance; and (iv) incorporate support for heterogeneous memory, networking technologies, and resource management software.
People
- Mosharaf Chowdhury (PI)
- Dr. Juncheng Gu → ByteDance
- Dr. Jie You → Meta
- Dr. Hasan Al Maruf → AMD
- Yiwen Zhang
Publications
- Memory Disaggregation: Advances and Open Challenges, H. A. Maruf, M. Chowdhury, arXiv:2305.03943
- TPP: Transparent Page Placement for CXL-Enabled Tiered Memory, H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. Kanaujia, P. Chauhan, ACM ASPLOS, 2023
- Aequitas: Admission Control for Performance-Critical RPCs in Datacenters, Y. Zhang, G. Kumar, N. Dukkipati, X. Wu, P. Jha, M. Chowdhury, A. Vahdat, ACM SIGCOMM, 2022
- TPP: Transparent page placement for CXL-Enabled tiered memory, H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. Kanaujia, P. Chauhan, arXiv:2206.02878
- Justitia: Software Multi-Tenancy in Hardware Kernel-Bypass Networks, Y. Zhang, Y. Tan, B. Stephens, M. Chowdhury, USENIX NSDI, 2022
- Hydra: Resilient and Highly Available Remote Memory, Y. Lee*, H. A. Maruf*, A. Cidon, M. Chowdhury, K. G. Shin, USENIX FAST, 2022 (*Equal contribution)
- Memtrade: A Disaggregated-Memory Marketplace for Public Clouds, H. A. Maruf, Y. Zhong, H. Wang, M. Chowdhury, A. Cidon, C. Waldspurger, arXiv:2108.06893
- Programmable Packet Scheduling with a Single Queue, Z. Yu, C. Hu, J. Wu, X. Sun, V. Braverman, M. Chowdhury, Z. Liu, X. Jin, ACM SIGCOMM, 2021
- Ship Compute or Ship Data? Why Not Both?, J. You, J. Wu, X. Jin, M. Chowdhury, USENIX NSDI, 2021
- Mitigating the Performance-Efficiency Tradeoff in Resilient Memory Disaggregation, Y. Lee, H. A. Maruf, M. Chowdhury, A. Cidon, K. G. Shin, arXiv:1910.09727
- NetLock: Fast, Centralized Lock Management Using Programmable Switches, Z. Yu, Y. Zhang, V. Braverman, M. Chowdhury, X. Jin, ACM SIGCOMM, 2020
- Effectively Prefetching Remote Memory with Leap, H. A. Maruf, M. Chowdhury, USENIX ATC, 2020 [hi]Best Paper Award[/hi]
Software
All software artifacts developed as part of this project are released as open-source with permissive licenses and can be found at https://github.com/SymbioticLab.
- Aequitas source code
- Justitia source code
- Memtrade source code
- AIFO source code
- Kayak source code
- Hydra source code
- NetLock source code
- Leap source code
Workshop
Outreach
K-12 students and educators can get involved in this project through the following resources.
- Center for Engineering Diversity & Outreach (CEDO)
- The Data Science Summer Camp for High School Students
- Research Education and Activities for Classroom Teachers (REACT)
Media
- Meta Platform Hacks CXL Memory Tier into Linux (nextplatform.com)
- “Hiding” network latency for fast memory in data centers (umich.edu)
- Chowdhury wins NSF CAREER award for making memory cheaper, more efficient in big data centers (umich.edu)
- Decentralized Memory Disaggregation Over Low-Latency Networks | USENIX
- Breakthrough for large scale computing: ‘Memory disaggregation’ made practical | University of Michigan News (umich.edu)
- Clever RDMA Technique Delivers Distributed Memory Pooling (nextplatform.com)
- Clever RDMA Technique Delivers Distributed Memory Pooling | Hacker News (ycombinator.com)
- University of Michigan Demonstrates Novel Memory Disaggregation Technology | TOP500
Support
This project is supported by a CAREER award from the National Science Foundation (CNS-1845853).