Our Publications
2026
Contextra: Hierarchical Context Caching for Long Context Language Model Serving
USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2026
Agent JIT Compilation for Latency-Optimizing Web Agent Planning and Scheduling
International Conference on Machine Learning (ICML), 2026
RaidServe: High-performance Resilient Serving
Conference on Machine Learning and Systems (MLSys), 2026
SYMI: Efficient Mixture-of-Experts Training via Model and Optimizer State Decoupling
USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2026
Sparse Checkpointing for Fast and Reliable MoE Training
USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2026
Hunting CUDA Bugs at Scale with cuFuzz
Proceedings of the ACM on Programming Languages (OOPSLA), 2026
2025
Wave: Offloading Resource Management to SmartNIC Cores
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025
AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution
Conference on Machine Learning and Systems (MLSys), 2025
· OpenReview · PDF
Teaching Cloud Infrastructure and Scalable Application Deployment in an Undergraduate Computer Science Program
ACM Technical Symposium on Computer Science Education (SIGCSETS), 2025
2024
WarpDrive: An Agentic Workflow for Ninja GPU Transformations
NeurIPS Workshop on Machine Learning for Systems, 2024
SGLang: Efficient Execution of Structured Language Model Programs
Conference on Neural Information Processing Systems (NeurIPS), 2024
· OpenReview · PDF
ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation
ACM SIGOPS Symposium on Operating Systems Principles (SOSP), 2024
cedar: Optimized and Unified Machine Learning Input Data Pipelines
The International Journal on Very Large Data Bases (VLDB), 2024
High-throughput and Flexible Host Networking for Accelerated Computing
USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024
2023
Tectonic-Shift: A Composite Storage Fabric for Large-Scale ML Training
USENIX Annual Technical Conference (USENIX ATC), 2023
· USENIX
Honeycomb: Secure and Efficient GPU Executions via Static Validation
USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2023
R3: Record-Replay-Retroaction for Database-Backed Applications
The International Journal on Very Large Data Bases (VLDB), 2023
RecD: Deduplication for end-to-end deep learning recommendation model training infrastructure
Conference on Machine Learning and Systems (MLSys), 2023
2022
Optimizing video analytics with declarative model relationships
The International Journal on Very Large Data Bases (VLDB), 2022
Hermod: principled and practical scheduling for serverless functions
ACM Symposium on Cloud Computing (SoCC), 2022
Towards μs tail latency and terabit ethernet: disaggregating the host network stack
ACM Special Interest Group on Data Communication (SIGCOMM), 2022
Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product
International Symposium on Computer Architecture (ISCA), 2022
SOL: Safe on-node learning in cloud platforms
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
ShEF: Shielded enclaves for cloud fpgas
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
RecShard: statistical feature-based memory optimization for industry-scale neural recommendation
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2022
VIVA: An End-to-End System for Interactive Video Analytics
Conference on Innovative Data Systems Research (CIDR), 2022
· Paper
A Progress Report on DBOS: A Database-oriented Operating System
Conference on Innovative Data Systems Research (CIDR), 2022
· Paper
2021
Llama: A heterogeneous & serverless framework for auto-tuning video analytics pipelines
ACM Symposium on Cloud Computing (SoCC), 2021
Faa$t: A Transparent Auto-Scaling Cache for Serverless Applications
ACM Symposium on Cloud Computing (SoCC), 2021
Syrup: User-defined scheduling across the stack
ACM SIGOPS Symposium on Operating Systems Principles (SOSP), 2021
ghOSt: Fast & Flexible User-Space Delegation of Linux Scheduling
ACM SIGOPS Symposium on Operating Systems Principles (SOSP), 2021
DBOS: a DBMS-oriented operating system
The International Journal on Very Large Data Bases (VLDB), 2021
INFaaS: Automated Model-less Inference Serving
USENIX Annual Technical Conference (USENIX ATC), 2021
· USENIX
A case against (most) context switches
USENIX Workshop on Hot Topics in Operating Systems (HotOS), 2021
Smartharvest: Harvesting idle cpus safely and efficiently in the cloud
European Conference on Computer Systems (EuroSys), 2021
Interference-aware scheduling for inference serving
European Conference on Machine Learning Systems (EuroMLSys), 2021
RAMBO: Resource allocation for microservices using Bayesian optimization
IEEE Computer Architecture Letters, 2021
2020
RackSched: A Microsecond-Scale Scheduler for Rack-Scale Computers
USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2020
· USENIX
Leveraging application classes to save power in highly-utilized data centers
ACM Symposium on Cloud Computing (SoCC), 2020
A polystore based database operating system (DBOS)
International Conference on Very Large Data Bases (VLDB) Workshop, 2020
Interstellar: Using Halides Scheduling Language to Analyze DNN Accelerators
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020
Classifying memory access patterns for prefetching
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2020
2019
Mind the gap: A case for informed request scheduling at the nic
ACM Workshop on Hot Topics in Networks (HotNets), 2019
Centralized core-granular scheduling for serverless functions
ACM Symposium on Cloud Computing (SoCC), 2019
From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers
USENIX Annual Technical Conference (USENIX ATC), 2019
· USENIX
AsmDB: understanding and mitigating front-end stalls in warehouse-scale computers
International Symposium on Computer Architecture (ISCA), 2019
A case for managed and model-less inference serving
USENIX Workshop on Hot Topics in Operating Systems (HotOS), 2019
Tangram: Optimized coarse-grained dataflow for scalable nn accelerators
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2019
Shinjuku: Preemptive Scheduling for μsecond-scale Tail Latency
USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2019
2018
Pocket: Elastic Ephemeral Storage for Serverless Analytics
USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2018
QuMan Profile-based Improvement of Cluster Utilization
ACM Transactions on Architecture and Code Optimization (TACO), 2018
Learning Memory Access Patterns
International Conference on Machine Learning (ICML), 2018
Understanding Ephemeral Storage for Serverless Analytics
USENIX Annual Technical Conference (USENIX ATC), 2018
· USENIX
Selecta: Heterogeneous Cloud Storage Configuration for Data Analytics
USENIX Annual Technical Conference (USENIX ATC), 2018
· USENIX
Spatial: A language and compiler for application accelerators
ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2018
Memory Hierarchy for Web Search
International Symposium on High-Performance Computer Architecture (HPCA), 2018
Making pull-based graph processing performant
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2018
GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition
International Symposium on High-Performance Computer Architecture (HPCA), 2018
2017
3D nanosystems enable embedded abundant-data computing: special session paper
International Conference on Hardware/Software Codesign and System Synthesis (CODES), 2017
Persona: A High-Performance Bioinformatics Framework
USENIX Annual Technical Conference (USENIX ATC), 2017
· USENIX
Plasticine: A Reconfigurable Architecture For Parallel Paterns
International Symposium on Computer Architecture (ISCA), 2017
Tetris: Scalable and efficient neural network acceleration with 3d memory
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017
Reflex: Remote flash≈ local flash
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017
Bolt: I know what you did last summer... in the cloud
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2017
2016
The IX operating system: Combining low latency, high throughput, and efficiency in a protected dataplane
ACM Transactions on Computer Systems (TOCS), 2016
DRAF: A low-power DRAM-based reconfigurable acceleration fabric
International Symposium on Computer Architecture (ISCA), 2016
Automatic generation of efficient accelerators for reconfigurable hardware
International Symposium on Computer Architecture (ISCA), 2016
Improving Resource Efficiency at Scale with Heracles
ACM Transactions on Computer Systems (TOCS), 2016
HRL: Efficient and flexible reconfigurable logic for near-data processing
International Symposium on High-Performance Computer Architecture (HPCA), 2016
Generating configurable hardware from parallel patterns
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2016
2015
Practical Near-Data Processing for In-Memory Analytics Frameworks
International Conference on Parallel Architectures and Compilation Techniques (PACT), 2015
Energy proportionality and workload consolidation for latency-critical applications
ACM Symposium on Cloud Computing (SoCC), 2015
Heracles: improving resource efficiency at scale
International Symposium on Computer Architecture (ISCA), 2015
2014
IX: A Protected Dataplane Operating System for High Throughput and Low Latency
USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2014
Towards energy proportionality for large-scale latency-critical workloads
International Symposium on Computer Architecture (ISCA), 2014
Dynamic management of TurboMode in modern multi-core chips
International Symposium on High-Performance Computer Architecture (HPCA), 2014
Reconciling high server utilization and sub-millisecond quality-of-service
European Conference on Computer Systems (EuroSys), 2014
High performance hardware-accelerated flash key-value store
Non-Volatile Memories Workshop (NVMW), 2014
Quasar: Resource-efficient and QoS-aware cluster management
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014
2013
QoS-aware scheduling in heterogeneous data centers with paragon
ACM Transactions on Computer Systems (TOCS), 2013
Measuring and analyzing the energy use of enterprise computing systems
Sustainable Computing Informatics and Systems, 2013
iBench: Quantifying interference for datacenter applications
IEEE International Symposium on Workload Characterization (IISWC), 2013
Locality-aware task management for unstructured parallelism: A quantitative limit study
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2013
ZSim: Fast and accurate microarchitectural simulation of thousand-core systems
International Symposium on Computer Architecture (ISCA), 2013
QoS-Aware Admission Control in Heterogeneous Datacenters
International Conference on Autonomic Computing (ICAC), 2013
· USENIX
Convolution engine: balancing efficiency & flexibility in specialized computing
Communications of the ACM (CACM), 2013
Resource efficient computing for warehouse-scale datacenters
Design, Automation and Test in Europe Conference and Exhibition (DATE), 2013
· ACM DL
Paragon: QoS-aware scheduling for heterogeneous datacenters
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2013
2012
A case of system-level hardware/software co-design and co-verification of a commodity multi-processor system with custom hardware
International Conference on Hardware/Software Codesign and System Synthesis (CODES), 2012
Towards energy-proportional datacenter memory with mobile DRAM
International Symposium on Computer Architecture (ISCA), 2012
Green enterprise computing data: Assumptions and realities
International Green Computing Conference (IGCC), 2012
ECHO: Recreating network traffic maps for datacenters with tens of thousands of servers
IEEE International Symposium on Workload Characterization (IISWC), 2012
Dune: Safe User-level Access to Privileged CPU Features
USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2012
Decoupling datacenter studies from access to large-scale applications: A modeling approach for storage workloads
IEEE International Symposium on Workload Characterization (IISWC), 2012
Improving system energy efficiency with memory rank subsetting
ACM Transactions on Architecture and Code Optimization (TACO), 2012
2011
MARS: adaptive remote execution for multi-threaded mobile devices
ACM SOSP Workshop on Networking, Systems, and Applications on Mobile Handhelds (MobiHeld), 2011
Dynamic Fine-Grain Scheduling of Pipeline Parallelism
International Conference on Parallel Architectures and Compilation Techniques (PACT), 2011
Time and cost-efficient modeling and generation of large-scale tpcc/tpce/tpch workloads
TPC Technology conference on Topics in Performance Evaluation, Measurement and Characterization (TPCTC), 2011
Vantage: Scalable and efficient fine-grain cache partitioning
International Symposium on Computer Architecture (ISCA), 2011
Cross-examination of datacenter workload modeling techniques
IEEE International Conference on Distributed Computing Systems (ICDCS), 2011
Storage I/O generation and replay for datacenter applications
IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2011
Hardware acceleration of transactional memory on commodity systems
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2011
Accurate modeling and generation of storage i/o for datacenter workloads
ACM Transactions on Storage (ToS), 2011
2010
The ZCache: Decoupling Ways and Associativity
International Symposium on Microarchitecture (MICRO), 2010
Eigenbench: A simple exploration tool for orthogonal TM characteristics
IEEE International Symposium on Workload Characterization (IISWC), 2010
Evaluating impact of manageability features on device performance
International Conference on Network and Service Management (CNSM), 2010
Understanding sources of inefficiency in general-purpose chips
International Symposium on Computer Architecture (ISCA), 2010
Making nested parallel transactions practical using lightweight hardware support
International Conference on Supercomputing (ICS), 2010
Implementing and evaluating nested parallel transactions in software transactional memory
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2010
FARM: A Prototyping Environment for Tightly-Coupled, Heterogeneous Architectures
IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2010
Evaluating Bufferless Flow Control for On-chip Networks
IEEE/ACM International Symposium on Networks-on-Chip (NOCS), 2010
An analysis of on-chip interconnection networks for large-scale chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO), 2010
Implementing and Evaluating a Model Checker for Transactional Memory Systems
IEEE International Conference on Engineering of Complex Computer Systems (ICECCS), 2010
Flexible architectural support for fine-grain scheduling
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2010
The case for RAMClouds: scalable high-performance storage entirely in DRAM
ACM SIGOPS Operating Systems Review, 2010
Identifying energy waste through dense power sensing and utilization monitoring
Technical Report, 2010
· Report
2009
Future scaling of processor-memory interfaces
International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2009
Energy dumpster diving
USENIX Workshop on Hot Topics in Power Aware Computing (HotPower), 2009
· Paper
Power management of datacenter workloads using per-core power gating
IEEE Computer Architecture Letters, 2009
Optimizing memory transactions for multicore systems
Multicore Processors and Systems. Integrated Circuits and Systems, 2009
Nemesis: Preventing Authentication and Access Control Vulnerabilities in Web Applications
USENIX Security Symposium, 2009
Fast memory snapshot for concurrent programmingwithout synchronization
International Conference on Supercomputing (ICS), 2009
Decoupling Dynamic Information Flow Tracking with a dedicated coprocessor
IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2009
A memory system design framework: creating smart memories
International Symposium on Computer Architecture (ISCA), 2009
Feedback-directed barrier optimization in a strongly isolated STM
ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), 2009
2008
Hardware Enforcement of Application Security Policies Using Tagged Memory
USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2008
Comparative evaluation of memory models for chip multiprocessors
ACM Transactions on Architecture and Code Optimization (TACO), 2008
A comparison of high-level full-system power models
USENIX Workshop on Hot Topics in Power Aware Computing (HotPower), 2008
STAMP: Stanford Transactional Applications for Multi-Processing
IEEE International Symposium on Workload Characterization (IISWC), 2008
Real-World Buffer Overflow Protection for Userspace and Kernelspace
USENIX Security Symposium, 2008
Improving software concurrency with hardware-assisted memory snapshot
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2008
Ased: availability, security, and debugging support usingtransactional memory
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2008
Thread-safe dynamic binary translation using transactional memory
International Symposium on High-Performance Computer Architecture (HPCA), 2008
2007
The OpenTM Transactional Application Programming Interface
International Conference on Parallel Architectures and Compilation Techniques (PACT), 2007
A low power front-end for embedded processors using a block-aware instruction set
International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES), 2007
Towards soft optimization techniques for parallel cognitive applications
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2007
Raksha: a flexible information flow architecture for software security
International Symposium on Computer Architecture (ISCA), 2007
JouleSort: a balanced energy-efficiency benchmark
ACM International Conference on Management of Data (SIGMOD), 2007
Comparing memory systems for chip multiprocessors
International Symposium on Computer Architecture (ISCA), 2007
An effective hybrid transactional memory system with strong isolation guarantees
International Symposium on Computer Architecture (ISCA), 2007
Register pointer architecture for efficient embedded processors
Design, Automation and Test in Europe Conference and Exhibition (DATE), 2007
ATLAS: A Chip-Multiprocessor with Transactional Memory Support
Design, Automation and Test in Europe Conference and Exhibition (DATE), 2007
Transactional programming in a multi-core environment
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2007
Transactional collection classes
ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), 2007
From chaos to QoS: case studies in CMP resource management
ACM SIGARCH Computer Architecture News, 2007
Evaluating MapReduce for Multi-core and Multiprocessor Systems
International Symposium on High-Performance Computer Architecture (HPCA), 2007
A Scalable, Non-blocking Approach to Transactional Memory
International Symposium on High-Performance Computer Architecture (HPCA), 2007
A practical FPGA-based framework for novel CMP research
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA), 2007
2006
Tradeoffs in transactional memory virtualization
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2006
Testing implementations of transactional memory
International Conference on Parallel Architectures and Compilation Techniques (PACT), 2006
Block-aware instruction set architecture
ACM Transactions on Architecture and Code Optimization (TACO), 2006
The Atomos transactional programming language
ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2006
Parallelizing specjbb2000 with transactional memory
Workshop on Transactional Memory Workloads (WTW) at PLDI, 2006
Deconstructing hardware architectures for security
Annual Workshop on Duplicating, Deconstructing, and Debunking, 2006
Architectural semantics for practical transactional memory
International Symposium on Computer Architecture (ISCA), 2006
Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors
Design, Automation and Test in Europe Conference and Exhibition (DATE), 2006
RAMP: research accelerator for multiple processors - a community vision for a shared experimental parallel HW/SW platform
IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2006
The Common Case Transactional Behavior of Multithreaded Programs
International Symposium on High-Performance Computer Architecture (HPCA), 2006
Building and using the atlas transactional memory system
Workshop on Architecture Research with FPGAs (WARP), 2006
2005
Automatic power management schemes for Internet servers and data centers
IEEE Global Telecommunications Conference (GLOBECOM), 2005
Characterization of TCC on chip-multiprocessors
International Conference on Parallel Architectures and Compilation Techniques (PACT), 2005
Improving instruction delivery with a block-aware ISA
European Conference on Parallel Processing (Euro-Par), 2005
Energy-efficient and high-performance instruction fetch using a block-aware ISA
International Symposium on Low Power Electronics and Design (ISPLD), 2005
TAPE: A transactional application profiling environment
International Conference on Supercomputing (ICS), 2005
Heuristics for Profile-Driven Method-Level Speculative Parallelization
International Conference on Parallel Processing (ICPP), 2005
ATLAS: A Scalable Emulator for Transactional Parallel Systems
Workshop on Architecture Research with FPGAs (WARP), 2005
2004
Programming with transactional coherence and consistency (TCC)
ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2004
The stream virtual machine
International Conference on Parallel Architectures and Compilation Techniques (PACT), 2004
Viram1: A media-oriented vector processor with embedded dram
Design Automation Student Design Contest, 2004
Transactional memory coherence and consistency
International Symposium on Computer Architecture (ISCA), 2004
2003
Overcoming the limitations of conventional vector processors
International Symposium on Computer Architecture (ISCA), 2003
2002
Vector vs. superscalar and VLIW architectures for embedded multimedia benchmarks
International Symposium on Microarchitecture (MICRO), 2002
2001
Intelligent Memory Systems: Second International Workshop, IMS 2000, Cambridge, MA, USA, November 12, 2000. Revised Papers
Lecture Notes in Computer Science, 2001
· Springer
1999
1998
1997
Intelligent RAM (IRAM): the industrial setting, applications, and architectures
IEEE International Conference on Computer Design (ICCD), 1997
The energy efficiency of IRAM architectures
International Symposium on Computer Architecture (ISCA), 1997
Intelligent RAM (IRAM): chips that remember and compute
IEEE International Solid-State Circuits Conference (ISSCC), 1997
Pipelined multi-queue management in a VLSI ATM switch chip with credit-based flow-control
Advanced Research in VLSI, 1997
· ACM DL