Latent Briefing: Efficient Memory Sharing for Multi-Agent Systems via KV Cache Compaction
Friday, April 10, 2026 AI
Scraped Article
Multi-agent systems have shown promise in coordination, complex reasoning, and parallel workflows. However, they are often highly token inefficient. In hierarchical architectures, where an orchestrator decomposes tasks and delegates to worker agents, redundant intermediate reasoning can emerge. As the orchestrator’s reasoning trajectory expands across numerous calls, token usage compounds rapidly. While these approaches can improve performance, they do so at substantial cost and often share context between agents inefficiently.
Existing approaches to managing this context such as LLM based summarization (slow) or retrieval via RAG (brittle) introduce their own tradeoffs. Instead, we use the model’s attention patterns to identify which parts of the context are important and discard the rest at the representation level. This leads to a method for sharing relevant memory between agents by operating directly on the model’s KV cache. We refer to this approach as Latent Briefing.
Across 126 questions on the LongBench v2 benchmark (spanning documents from 0–100k tokens), our approach achieved:
Comparable or improved accuracy relative to the baseline across difficulty and context length conditions
Up to 49% median token savings on medium length (32k–100k token) documents
65% reduction in worker model token consumption
~1.7s median compaction overhead, scaling linearly with input length
Token Explosion in Recursive Agents
We adopted the Recursive Language Model (RLM) framework (Zhang et al., 2025) as our base architecture for multi agent systems. In RLM, a strong orchestrator decomposes a task and makes repeated calls to a worker model through a REPL environment. The orchestrator sends targeted queries to the worker asking it to analyze specific aspects of the document, verify hypotheses, or extract information.
While RLM’s have shown strength in their longer context management they are less efficient than traditional LLM’s and use significantly more tokens. Additionally, the worker only sees what the orchestrator explicitly passes it: typically a targeted query and the raw document. But the orchestrator has been building up a rich trajectory of reasoning across many calls: hypotheses tested, passages identified, dead ends eliminated, cross references discovered. That accumulated context could help the worker answer more effectively, but passing it all as text inflates input costs with every successive call. The worker ends up working with a narrow view of the problem while the orchestrator's broader understanding sits unused.
Standard solutions all have significant drawbacks:
LLM Summarization: 20–60s latency per step, lossy, summary may not capture what the sub task needs
RAG / Retrieval: Requires chunking and embedding, misses cross chunk dependencies
Pass everything: Expensive, slow, and accuracy can degrade with irrelevant context
We wanted fast and precise cross agent memory to try and reduce this token explosion.
Task Guided KV Cache Compaction
Background: The AM Compaction Framework
Our approach builds on the Attention Matching (AM) framework for KV cache compaction (Zweiger et al., 2026). The core idea is given a KV cache of size S , find a compact cache of size t < S that produces nearly identical attention outputs.
Formally, for each attention head, we seek compacted components (C1, β, C2) such that:
\text{softmax}(Q \cdot C_1^T + \beta) \cdot C_2 \approx \text{softmax}(Q \cdot K^T) \cdot V
where:
C1 (compacted keys): a subset of the original key vectors selected for high attention
β (bias corrections): scalar adjustments that compensate for missing keys, ensuring the softmax distribution over kept keys approximates the original distribution over all keys
C2 (compacted values): reconstructed value vectors solved via ridge regression
The original AM algorithm processes each (layer, head) pair independently. For Qwen3-14B, that means 40 layers × 8 KV heads = 320 serialized solves, each running three steps:
Token selection: compute attention scores between all queries and all key positions, then select the top t positions with the highest aggregate score.
Beta via NNLS: find bias corrections β so that softmax(q · C1ᵀ + β) approximates softmax(q · Kᵀ) for the kept tokens: solved via projected gradient descent with non-negativity constraints¹.
C2 via ridge regression: solve C2 = (XᵀX + λI)⁻¹XᵀY where X is the compacted softmax matrix and Y is the original attention output, reconstructing value vectors that preserve the attention computation.
Our Modifications:
We made three key changes to adapt AM compaction for the inference setting:
1. Task guided query vectors. In the original AM framework, the queries used for scoring are sampled from the context itself. We replace these with queries derived from the orchestrator's task prompt for this specific worker call. This enables cache compression that prioritizes information most relevant to the worker task.
The trajectory here is the orchestrator's full con