Generative Adversarial Networks

Ian Goodfellow; Jean Pouget-Abadie; Mehdi Mirza; Bing Xu; David Warde-Farley; Sherjil Ozair; Aaron Courville; Yoshua Bengio

doi:10.48550/arxiv.1406.2661

Generative Adversarial Networks

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio

arXiv (Cornell University)

June 10, 2014

10.48550/arxiv.1406.2661

Cited by 4,551Open Access

Full Text

Abstract

Large Language Models (LLMS) rely on Key-Value (KV) caches to store attention context during autoregressive decoding. In long-sequence settings, the KV cache can consume large amounts of VRAM and become a practical bottleneck for throughput . We introduce KVHALO, an auxiliary reconstruction model that restores higher-fidelity KV tensors from a compressed cache state when required, reducing persistent memory footprint during inference. In our evaluation, KVHALO achieves up to 91.85% directional cosine alignment at convergence and reduces long-context degradation relative to a low-bit baseline under our stress-test workloads. We used HRM instead of other architectures, which allowed for higher-quality results in only 18,600 steps.

Related Papers

No related papers found

Powered by citation graph analysis