Research
All
2025

RocketKV: Accelerating Long-Context LLM Inference via Two-Stage KV Cache Compression
arXiv
·
21 Feb 2025
·
arxiv:2502.14051
2024

Inshrinkerator: Compressing Deep Learning Training Checkpoints via Dynamic Quantization
15th ACM Symposium on Cloud Computing (SoCC 2024), Redmond, USA, Nov 2024
·
15 Oct 2024
·
arxiv:2306.11800

Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
arXiv
·
27 Sep 2024
·
arxiv:2409.17264

Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems
arXiv
·
02 Sep 2024
·
arxiv:2407.07000

SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference
18th European Conference on Computer Vision (ECCV 2024), Milano, Italy, Oct 2024
·
12 Jul 2024
·
arxiv:2301.10879

D{\epsilon}pS: Delayed {\epsilon}-Shrinking for Faster Once-For-All Training
18th European Conference on Computer Vision (ECCV 2024), Milano, Italy, Oct 2024
·
09 Jul 2024
·
arxiv:2407.06167

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
18th USENIX Symposium on Operating Systems Design and Implementation (OSDI’24), Santa Clara
·
19 Jun 2024
·
arxiv:2403.02310

Harmonica: Hybrid Accelerator to Overcome Imperfections of Mixed-signal DNN Accelerators
Proc. 38'th IEEE International Parallel and Distributed Processing Symposium (IPDPS'24)
·
27 May 2024

Vidur: A Large-Scale Simulation Framework For LLM Inference
7th Annual Conference on Machine Learning Systems (MLSys’24), Santa Clara
·
22 May 2024
·
arxiv:2405.05465
2023

SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI'25), Philadelphia, USA, 2025.
·
29 Dec 2023
·
arxiv:2312.16733

TransEHR: Self-Supervised Transformer for Clinical Time Series Data
Proc. of Machine Learning for Health (ML4H'23)
·
10 Dec 2023

Hardware–Software Co-Design for Real-Time Latency–Accuracy Navigation in Tiny Machine Learning Applications
IEEE Micro
·
01 Nov 2023
·
doi:10.1109/MM.2023.3317243

ABKD: Graph Neural Network Compression with Attention-Based Knowledge Distillation
arXiv
·
25 Oct 2023
·
arxiv:2310.15938

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
arXiv
·
01 Sep 2023
·
arxiv:2308.16369

Pareto-Secure Machine Learning (PSML): Fingerprinting and Securing Inference Serving Systems
arXiv
·
08 Aug 2023
·
arxiv:2307.01292

Subgraph Stationary Hardware-Software Inference Co-Design
Proc. of Sixth Conference on Machine Learning and Systems (MLSys'23)
·
03 Jul 2023
·
arxiv:2306.17266

Signed-Binarization: Unlocking Efficiency Through Repetition-Sparsity Trade-Off
Proc. of 3rd On-Device Intelligence Workshop, Machine Learning and Systems (MLSys'23)
·
01 Jun 2023
2022

UnfoldML: Cost-Aware and Uncertainty-Based Dynamic 2D Prediction for Multi-Stage Classification
Proc. of 36'th Conference on Neural Information Processing Systems (NeurIPS'22)
·
31 Oct 2022
·
arxiv:2210.15056

Enabling Real-time DNN Switching via Weight-Sharing
Proc. of 2nd Architecture, Compiler, and System Support for Multi-model DNN Workloads Workshop
·
01 Jun 2022
2021

CompOFA: Compound Once-For-All Networks for Faster Multi-Platform Deployment
Proc. of International Conference on Learning Representations (ICLR'21)
·
27 Apr 2021
·
arxiv:2104.12642
2020

HOLMES: Health OnLine Model Ensemble Serving for Deep Learning Models in Intensive Care Units
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
·
20 Aug 2020
·
doi:10.1145/3394486.3403212