Agentic AI scaling requires new memory architecture.
Agentic AI represents a fundamental shift from **stateless chatbots** toward **long-running, goal-driven systems** capable of reasoning, planning, using tools, and retaining memory across sessions. As these systems grow more capable, they also expose a critical infrastructure bottleneck: **memory**.

Agentic AI represents a fundamental shift from stateless chatbots toward long-running, goal-driven systems capable of reasoning, planning, using tools, and retaining memory across sessions. As these systems grow more capable, they also expose a critical infrastructure bottleneck: memory.
While foundation models continue to scale toward trillions of parameters and context windows stretch into millions of tokens, the cost of remembering history is now growing faster than the ability to process it. This imbalance is forcing enterprises to rethink how AI memory is stored, moved, and powered.
The Memory Bottleneck Holding Back Agentic AI
At the heart of the issue is long-term inference memory, technically known as the Key-Value (KV) cache. Transformer-based models store previous token states in this cache to avoid recomputing the entire conversation history every time a new token is generated.
In agentic workflows, this KV cache becomes far more than short-term context:
- It persists across tool calls and tasks
- It grows linearly with sequence length
- It acts as the “working memory” of intelligent agents
The result is a new class of data that is latency-critical but ephemeral, and existing hardware architectures are not designed to handle it efficiently.
Why Current Infrastructure Falls Short
Today’s systems force organisations into a costly binary choice:
Option 1: GPU High-Bandwidth Memory (HBM)
- Extremely fast
- Extremely expensive
- Limited capacity
Using HBM to store massive context windows quickly becomes cost-prohibitive.
Option 2: General-Purpose Storage
- Cheap and scalable
- High latency
- Designed for durability, not speed
When inference context spills from GPU memory (G1) to system RAM (G2), and eventually to shared storage (G4), performance collapses. GPUs sit idle while waiting for memory to load, increasing latency, power consumption, and total cost of ownership (TCO).
This inefficiency makes real-time agentic interactions impractical at scale.
NVIDIA Introduces ICMS: A New Memory Tier
To address this growing disparity, NVIDIA has introduced Inference Context Memory Storage (ICMS) as part of its Rubin architecture. ICMS creates a new, purpose-built memory tier designed specifically for AI inference context.
Jensen Huang describes the shift clearly:
“AI is no longer about one-shot chatbots but intelligent collaborators that understand the physical world, reason over long horizons, use tools, and retain both short- and long-term memory.”
Introducing the “G3.5” Memory Tier
ICMS effectively inserts a new layer into the memory hierarchy:
- G1: GPU HBM
- G2: Host system memory
- G3.5: Ethernet-attached flash for inference context
- G4: Traditional shared storage
This G3.5 tier is optimized for high-velocity, short-lived AI memory, offering:
- Much lower latency than traditional storage
- Far lower cost than GPU HBM
- Petabytes of shared capacity per compute pod
By placing inference context closer to compute, ICMS allows models to retrieve memory quickly without occupying scarce GPU resources.
Performance and Energy Gains
The benefits of this architecture are measurable:
- Up to 5× higher tokens-per-second (TPS) for long-context workloads
- 5× better power efficiency compared to general-purpose storage
- Reduced GPU idle time through memory “pre-staging”
- Lower infrastructure overhead and improved TCO
By eliminating unnecessary durability guarantees and CPU-heavy metadata handling, ICMS delivers performance where it actually matters: real-time reasoning.
Integrating the Data Plane
ICMS relies on tight integration between compute, storage, and networking:
- NVIDIA BlueField-4 DPUs offload context management from CPUs
- Spectrum-X Ethernet delivers low-latency, high-bandwidth connectivity
- NVIDIA Dynamo and NIXL orchestrate KV movement between memory tiers
- DOCA framework treats inference context as a first-class resource
Together, these components allow flash storage to behave almost like an extension of system memory.
Industry Adoption Is Already Underway
Major infrastructure vendors are aligning with this new architecture, including:
- Dell Technologies
- HPE
- IBM
- Pure Storage
- Nutanix
- Supermicro
- VAST Data
- WEKA
- Hitachi Vantara
Platforms built around BlueField-4 are expected to reach the market in the second half of the year.
What This Means for Enterprises
Adopting a dedicated context memory tier forces organisations to rethink infrastructure strategy:
Reclassifying Data
KV cache is neither durable nor archival. It is ephemeral but latency-sensitive, and treating it like traditional data is inefficient.
Smarter Orchestration
Topology-aware scheduling ensures workloads run near their cached context, minimising data movement and network congestion.
Datacenter Design
Higher compute density per rack improves efficiency but requires careful planning for cooling and power delivery.
Redefining Infrastructure for Agentic AI
Agentic AI breaks the old model of compute separated from slow, persistent storage. Systems with long-term memory demand fast, shared, and energy-efficient context access.
By introducing a specialised memory tier, organisations can:
- Decouple memory growth from GPU cost
- Serve multiple agents from a shared low-power memory pool
- Scale complex reasoning workloads sustainably
As enterprises plan their next AI investments, memory hierarchy design will be just as critical as GPU selection.
Frequently Asked Questions (FAQ)
What is agentic AI?
Agentic AI refers to systems that can plan, reason, use tools, and retain memory across long time horizons, rather than responding to single prompts.
What is a KV cache?
The Key-Value cache stores intermediate transformer states so models don’t need to recompute previous context for each new token.
Why is KV cache a problem at scale?
It grows linearly with context length and quickly overwhelms GPU memory, creating cost and latency bottlenecks.
What makes ICMS different from traditional storage?
ICMS is designed for speed, not durability. It removes unnecessary overhead like replication and metadata management.
What is the G3.5 memory tier?
It is an intermediate layer between system memory and shared storage, optimized for high-speed AI inference context.
Who benefits most from this architecture?
Enterprises running long-context AI agents, multi-step workflows, copilots, and real-time reasoning systems.
Is this replacing GPUs or HBM?
No. It complements GPUs by freeing expensive HBM for active computation instead of passive memory storage.
Why does this matter now?
As AI agents grow more capable, memory — not compute — is becoming the primary bottleneck for scaling intelligent systems.


