Research
All
2024
Mnemosyne: Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
arXiv
·
27 Sep 2024
·
arxiv:2409.17264
SuperFedNAS: Cost-Efficient Federated Neural Architecture Search for On-Device Inference
18th European Conference on Computer Vision (ECCV 2024), Milano, Italy, Oct 2024
·
12 Jul 2024
·
arxiv:2301.10879
Metron: Holistic Performance Evaluation Framework for LLM Inference Systems
arXiv
·
10 Jul 2024
·
arxiv:2407.07000
D{\epsilon}pS: Delayed {\epsilon}-Shrinking for Faster Once-For-All Training
18th European Conference on Computer Vision (ECCV 2024), Milano, Italy, Oct 2024
·
09 Jul 2024
·
arxiv:2407.06167
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
18th USENIX Symposium on Operating Systems Design and Implementation (OSDI’24), Santa Clara
·
19 Jun 2024
·
arxiv:2403.02310
Harmonica: Hybrid Accelerator to Overcome Imperfections of Mixed-signal DNN Accelerators
Proc. 38'th IEEE International Parallel and Distributed Processing Symposium (IPDPS'24)
·
27 May 2024
Vidur: A Large-Scale Simulation Framework For LLM Inference
7th Annual Conference on Machine Learning Systems (MLSys’24), Santa Clara
·
22 May 2024
·
arxiv:2405.05465
2023
SuperServe: Fine-Grained Inference Serving for Unpredictable Workloads
22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI'25), Philadelphia, USA, 2025.
·
29 Dec 2023
·
arxiv:2312.16733
TransEHR: Self-Supervised Transformer for Clinical Time Series Data
Proc. of Machine Learning for Health (ML4H'23)
·
10 Dec 2023
Hardware–Software Co-Design for Real-Time Latency–Accuracy Navigation in Tiny Machine Learning Applications
IEEE Micro
·
01 Nov 2023
·
doi:10.1109/MM.2023.3317243
ABKD: Graph Neural Network Compression with Attention-Based Knowledge Distillation
arXiv
·
25 Oct 2023
·
arxiv:2310.15938
DynaQuant: Compressing Deep Learning Training Checkpoints via Dynamic Quantization
15th ACM Symposium on Cloud Computing (SoCC 2024), Redmond, USA, Nov 2024
·
06 Sep 2023
·
arxiv:2306.11800
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills
arXiv
·
01 Sep 2023
·
arxiv:2308.16369
Pareto-Secure Machine Learning (PSML): Fingerprinting and Securing Inference Serving Systems
arXiv
·
08 Aug 2023
·
arxiv:2307.01292
Subgraph Stationary Hardware-Software Inference Co-Design
Proc. of Sixth Conference on Machine Learning and Systems (MLSys'23)
·
03 Jul 2023
·
arxiv:2306.17266
Signed-Binarization: Unlocking Efficiency Through Repetition-Sparsity Trade-Off
Proc. of 3rd On-Device Intelligence Workshop, Machine Learning and Systems (MLSys'23)
·
01 Jun 2023
2022
UnfoldML: Cost-Aware and Uncertainty-Based Dynamic 2D Prediction for Multi-Stage Classification
Proc. of 36'th Conference on Neural Information Processing Systems (NeurIPS'22)
·
31 Oct 2022
·
arxiv:2210.15056
Enabling Real-time DNN Switching via Weight-Sharing
Proc. of 2nd Architecture, Compiler, and System Support for Multi-model DNN Workloads Workshop
·
01 Jun 2022
2021
CompOFA: Compound Once-For-All Networks for Faster Multi-Platform Deployment
Proc. of International Conference on Learning Representations (ICLR'21)
·
27 Apr 2021
·
arxiv:2104.12642
2020
HOLMES: Health OnLine Model Ensemble Serving for Deep Learning Models in Intensive Care Units
Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
·
20 Aug 2020
·
doi:10.1145/3394486.3403212