Contextra: Hierarchical Context Caching for Long Context Language Model Serving

Zhiqiang Xie Stanford

Ziyi Xu

Mark Zhao Stanford

Yuwei An

Vikram Sharma Mailthody

Scott Mahlke

Michael Garland

Christos Kozyrakis Stanford

USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2026


Abstract

Large Language Models (LLMs) with expanding context windows face significant performance hurdles. While caching key-value (KV) states is critical for avoiding redundant computation, the storage footprint of long-context caches quickly exceeds GPU memory capacity, forcing production systems to adopt hierarchical caching across memory hierarchies. However, transferring large cached contexts back to the GPU introduces severe performance bottlenecks: fragmented I/O from paged layouts prevents full bandwidth utilization, and existing schedulers fail to account for cache-loading delays, leaving systems loading-bound rather than compute-bound. We present Contextra, a hierarchical context caching framework designed for efficient long context LLM serving. Contextra introduces GPU-assisted I/O to combat KV cache fragmentation, decoupling GPU and CPU memory layouts and employs cache-aware request scheduling to balance compute with I/O latency and overlapping unavoidable stalls with complementary tasks. Built on SGLang and deployed in production, Contextra achieves up to 5x lower Time-To-First-Token (TTFT) compared to vLLM + LMCache and 3.75x speedup over NVIDIA TensorRT-LLM on long-context benchmarks, without degrading short-context performance.