Research

All

2025

On Evaluating Performance of LLM Inference Serving Systems

Amey Agrawal, Nitin Kedia, Anmol Agarwal, Jayashree Mohan, Nipun Kwatra, Souvik Kundu, Ramachandran Ramjee, Alexey Tumanov

arXiv · 15 Jul 2025 · arxiv:2507.09019

Toward Weight Sharing Paradigm for Efficient AI: Training and Inference Serving

Payman Behnam, Alind Khare, Dhruv Garg, Alexey Tumanov

ACM SIGOPS Operating Systems Review, Volume 59, Issue 2 · 01 Jul 2025

Efficient LLM Inference via Chunked Prefills

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee

ACM SIGOPS Operating Systems Review, Volume 59, Issue 2 · 01 Jul 2025

EMPIRIC: Exploring Missing Pieces in KV Cache Compression for Reducing Computation, Storage, and Latency in Long-Context LLM Inference

Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov

ACM SIGOPS Operating Systems Review, Volume 59, Issue 2 · 01 Jul 2025

Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations

Amey Agrawal, Haoran Qiu, Junda Chen, Íñigo Goiri, Chaojie Zhang, Rayyan Shahid, Ramachandran Ramjee, Alexey Tumanov, Esha Choukse

arXiv · 12 May 2025 · arxiv:2409.17264

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads

Alind Khare, Dhruv Garg, Sukrit Kalra, Snigdha Grandhi, Ion Stoica, Alexey Tumanov

22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 2025) · 28 Apr 2025

Video

Paper

Client Availability in Federated Learning: It Matters!

Dhruv Garg, Debopam Sanyal, Myungjin Lee, Alexey Tumanov, Ada Gavrilovska

5th Workshop on Machine Learning and Systems (EuroMLSys), co-located with EuroSys '25 · 30 Mar 2025 · doi:10.1145/3721146.3721964

Paper

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression

Payman Behnam, Yaosheng Fu, Ritchie Zhao, Po-An Tsai, Zhiding Yu, Alexey Tumanov

arXiv · 21 Feb 2025 · arxiv:2502.14051

2024

Inshrinkerator: Compressing Deep Learning Training Checkpoints via Dynamic Quantization

Amey Agrawal, Sameer Reddy, Satwik Bhattamishra, Venkata Prabhakara Sarath Nookala, Vidushi Vashishth, Kexin Rong, Alexey Tumanov

15th ACM Symposium on Cloud Computing (SoCC 2024), Redmond, USA, Nov 2024 · 15 Oct 2024 · arxiv:2306.11800

Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems

Amey Agrawal, Anmol Agarwal, Nitin Kedia, Jayashree Mohan, Souvik Kundu, Nipun Kwatra, Ramachandran Ramjee, Alexey Tumanov

arXiv · 02 Sep 2024 · arxiv:2407.07000

Code

SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference

Alind Khare, Animesh Agrawal, Aditya Annavajjala, Payman Behnam, Myungjin Lee, Hugo Latapie, Alexey Tumanov

18th European Conference on Computer Vision (ECCV 2024), Milano, Italy, Oct 2024 · 12 Jul 2024 · arxiv:2301.10879

$D{\epsilon}pS: Delayed {\epsilon}-Shrinking for Faster Once-For-All Training$

D{\epsilon}pS: Delayed {\epsilon}-Shrinking for Faster Once-For-All Training

Aditya Annavajjala, Alind Khare, Animesh Agrawal, Igor Fedorov, Hugo Latapie, Myungjin Lee, Alexey Tumanov

18th European Conference on Computer Vision (ECCV 2024), Milano, Italy, Oct 2024 · 09 Jul 2024 · arxiv:2407.06167

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee

18th USENIX Symposium on Operating Systems Design and Implementation (OSDI’24), Santa Clara · 19 Jun 2024 · arxiv:2403.02310

Video

Code

Harmonica: Hybrid Accelerator to Overcome Imperfections of Mixed-signal DNN Accelerators

Payman Behnam, Uday Kamal, A. Shafiee, Alexey Tumanov, Saibal Mukhopadhyay

Proc. 38'th IEEE International Parallel and Distributed Processing Symposium (IPDPS'24) · 27 May 2024

Vidur: A Large-Scale Simulation Framework For LLM Inference

Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gulavani, Ramachandran Ramjee, Alexey Tumanov

7th Annual Conference on Machine Learning Systems (MLSys’24), Santa Clara · 22 May 2024 · arxiv:2405.05465

Video

Code

2023

TransEHR: Self-Supervised Transformer for Clinical Time Series Data

Yanbo Xu, Shangqing Xu, Manav Ramprassad, Alexey Tumanov, Chao Zhang

Proc. of Machine Learning for Health (ML4H'23) · 10 Dec 2023

Hardware–Software Co-Design for Real-Time Latency–Accuracy Navigation in Tiny Machine Learning Applications

Payman Behnam, Jianming Tong, Alind Khare, Yangyu Chen, Yue Pan, Pranav Gadikar, Abhimanyu Bambhaniya, Tushar Krishna, Alexey Tumanov

IEEE Micro · 01 Nov 2023 · doi:10.1109/MM.2023.3317243

ABKD: Graph Neural Network Compression with Attention-Based Knowledge Distillation

Anshul Ahluwalia, Rohit Das, Payman Behnam, Alind Khare, Pan Li, Alexey Tumanov

arXiv · 25 Oct 2023 · arxiv:2310.15938

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee

arXiv · 01 Sep 2023 · arxiv:2308.16369

Pareto-Secure Machine Learning (PSML): Fingerprinting and Securing Inference Serving Systems

Debopam Sanyal, Jui-Tse Hung, Manav Agrawal, Prahlad Jasti, Shahab Nikkhoo, Somesh Jha, Tianhao Wang, Sibin Mohan, Alexey Tumanov

arXiv · 08 Aug 2023 · arxiv:2307.01292

Subgraph Stationary Hardware-Software Inference Co-Design

Payman Behnam, Jianming Tong, Alind Khare, Yangyu Chen, Yue Pan, Pranav Gadikar, Abhimanyu Rajeshkumar Bambhaniya, Tushar Krishna, Alexey Tumanov

Proc. of Sixth Conference on Machine Learning and Systems (MLSys'23) · 03 Jul 2023 · arxiv:2306.17266

Slides

Signed-Binarization: Unlocking Efficiency Through Repetition-Sparsity Trade-Off

Sachit Kuhar, Alexey Tumanov, Judy Hoffman

Proc. of 3rd On-Device Intelligence Workshop, Machine Learning and Systems (MLSys'23) · 01 Jun 2023

2022

UnfoldML: Cost-Aware and Uncertainty-Based Dynamic 2D Prediction for Multi-Stage Classification

Yanbo Xu, Alind Khare, Glenn Matlin, Monish Ramadoss, Rishikesan Kamaleswaran, Chao Zhang, Alexey Tumanov

Proc. of 36'th Conference on Neural Information Processing Systems (NeurIPS'22) · 31 Oct 2022 · arxiv:2210.15056

Enabling Real-time DNN Switching via Weight-Sharing

Jianming Tong, Yangyu Chen, Yue Pan, Abhimanyu Bambhaniya, Alind Khare, Taekyung Heo, Alexey Tumanov, Tushar Krishna

Proc. of 2nd Architecture, Compiler, and System Support for Multi-model DNN Workloads Workshop · 01 Jun 2022

2021

CompOFA: Compound Once-For-All Networks for Faster Multi-Platform Deployment

Manas Sahni, Shreya Varshini, Alind Khare, Alexey Tumanov

Proc. of International Conference on Learning Representations (ICLR'21) · 27 Apr 2021 · arxiv:2104.12642

Video

Code

2020

HOLMES: Health OnLine Model Ensemble Serving for Deep Learning Models in Intensive Care Units

Shenda Hong, Yanbo Xu, Alind Khare, Satria Priambada, Kevin Maher, Alaa Aljiffry, Jimeng Sun, Alexey Tumanov

Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining · 20 Aug 2020 · doi:10.1145/3394486.3403212

Video

Code