Published January 9, 20266 min read

Agentic AI scaling requires new memory architecture.

Agentic AI represents a fundamental shift from **stateless chatbots** toward **long-running, goal-driven systems** capable of reasoning, planning, using tools, and retaining memory across sessions. As these systems grow more capable, they also expose a critical infrastructure bottleneck: **memory**.

Agentic AI infrastructure concept

Agentic AI represents a fundamental shift from stateless chatbots toward long-running, goal-driven systems capable of reasoning, planning, using tools, and retaining memory across sessions. As these systems grow more capable, they also expose a critical infrastructure bottleneck: memory.

While foundation models continue to scale toward trillions of parameters and context windows stretch into millions of tokens, the cost of remembering history is now growing faster than the ability to process it. This imbalance is forcing enterprises to rethink how AI memory is stored, moved, and powered.

The Memory Bottleneck Holding Back Agentic AI

AI memory hierarchy

At the heart of the issue is long-term inference memory, technically known as the Key-Value (KV) cache. Transformer-based models store previous token states in this cache to avoid recomputing the entire conversation history every time a new token is generated.

In agentic workflows, this KV cache becomes far more than short-term context:

It persists across tool calls and tasks
It grows linearly with sequence length
It acts as the “working memory” of intelligent agents

The result is a new class of data that is latency-critical but ephemeral, and existing hardware architectures are not designed to handle it efficiently.

Why Current Infrastructure Falls Short

Today’s systems force organisations into a costly binary choice:

Option 1: GPU High-Bandwidth Memory (HBM)

Extremely fast
Extremely expensive
Limited capacity

Using HBM to store massive context windows quickly becomes cost-prohibitive.

Option 2: General-Purpose Storage

Cheap and scalable
High latency
Designed for durability, not speed

When inference context spills from GPU memory (G1) to system RAM (G2), and eventually to shared storage (G4), performance collapses. GPUs sit idle while waiting for memory to load, increasing latency, power consumption, and total cost of ownership (TCO).

This inefficiency makes real-time agentic interactions impractical at scale.

NVIDIA Introduces ICMS: A New Memory Tier

NVIDIA data center architecture

To address this growing disparity, NVIDIA has introduced Inference Context Memory Storage (ICMS) as part of its Rubin architecture. ICMS creates a new, purpose-built memory tier designed specifically for AI inference context.

Jensen Huang describes the shift clearly:

“AI is no longer about one-shot chatbots but intelligent collaborators that understand the physical world, reason over long horizons, use tools, and retain both short- and long-term memory.”

Introducing the “G3.5” Memory Tier

ICMS effectively inserts a new layer into the memory hierarchy:

G1: GPU HBM
G2: Host system memory
G3.5: Ethernet-attached flash for inference context
G4: Traditional shared storage

This G3.5 tier is optimized for high-velocity, short-lived AI memory, offering:

Much lower latency than traditional storage
Far lower cost than GPU HBM
Petabytes of shared capacity per compute pod

By placing inference context closer to compute, ICMS allows models to retrieve memory quickly without occupying scarce GPU resources.

Performance and Energy Gains

AI efficiency and power usage

The benefits of this architecture are measurable:

Up to 5× higher tokens-per-second (TPS) for long-context workloads
5× better power efficiency compared to general-purpose storage
Reduced GPU idle time through memory “pre-staging”
Lower infrastructure overhead and improved TCO

By eliminating unnecessary durability guarantees and CPU-heavy metadata handling, ICMS delivers performance where it actually matters: real-time reasoning.

Integrating the Data Plane

ICMS relies on tight integration between compute, storage, and networking:

NVIDIA BlueField-4 DPUs offload context management from CPUs
Spectrum-X Ethernet delivers low-latency, high-bandwidth connectivity
NVIDIA Dynamo and NIXL orchestrate KV movement between memory tiers
DOCA framework treats inference context as a first-class resource

Together, these components allow flash storage to behave almost like an extension of system memory.

Industry Adoption Is Already Underway

Enterprise AI infrastructure

Major infrastructure vendors are aligning with this new architecture, including:

Dell Technologies
HPE
IBM
Pure Storage
Nutanix
Supermicro
VAST Data
WEKA
Hitachi Vantara

Platforms built around BlueField-4 are expected to reach the market in the second half of the year.

What This Means for Enterprises

Adopting a dedicated context memory tier forces organisations to rethink infrastructure strategy:

Reclassifying Data

KV cache is neither durable nor archival. It is ephemeral but latency-sensitive, and treating it like traditional data is inefficient.

Smarter Orchestration

Topology-aware scheduling ensures workloads run near their cached context, minimising data movement and network congestion.

Datacenter Design

Higher compute density per rack improves efficiency but requires careful planning for cooling and power delivery.

Redefining Infrastructure for Agentic AI

Agentic AI breaks the old model of compute separated from slow, persistent storage. Systems with long-term memory demand fast, shared, and energy-efficient context access.

By introducing a specialised memory tier, organisations can:

Decouple memory growth from GPU cost
Serve multiple agents from a shared low-power memory pool
Scale complex reasoning workloads sustainably

As enterprises plan their next AI investments, memory hierarchy design will be just as critical as GPU selection.

Frequently Asked Questions (FAQ)

What is agentic AI?

Agentic AI refers to systems that can plan, reason, use tools, and retain memory across long time horizons, rather than responding to single prompts.

What is a KV cache?

The Key-Value cache stores intermediate transformer states so models don’t need to recompute previous context for each new token.

Why is KV cache a problem at scale?

It grows linearly with context length and quickly overwhelms GPU memory, creating cost and latency bottlenecks.

What makes ICMS different from traditional storage?

ICMS is designed for speed, not durability. It removes unnecessary overhead like replication and metadata management.

What is the G3.5 memory tier?

It is an intermediate layer between system memory and shared storage, optimized for high-speed AI inference context.

Who benefits most from this architecture?

Enterprises running long-context AI agents, multi-step workflows, copilots, and real-time reasoning systems.

Is this replacing GPUs or HBM?

No. It complements GPUs by freeing expensive HBM for active computation instead of passive memory storage.

Why does this matter now?

As AI agents grow more capable, memory — not compute — is becoming the primary bottleneck for scaling intelligent systems.

Categories & Topics

OpenAI buys tiny health records startup Torch for, reportedly, $100M

OpenAI has reportedly acquired **Torch**, a small but strategically positioned health records startup, in a deal valued at around **$100 million**. While neither company has officially disclosed detailed financial terms, the acquisition signals OpenAIs growing interest in **healthcare data infrastructure**, a domain where AIs potential is vast but tightly constrained by privacy, regulation, and trust.

+15

Read Full Article

January 13, 2026

Meta-backed Hupo finds growth after pivot to AI sales coaching from mental wellness

Meta-backed startup **Hupo** has found renewed momentum after pivoting away from its original focus on mental wellness to an AI-powered sales coaching platform. The shift reflects a broader trend in the startup ecosystem, where companies are repurposing AI capabilities to target clearer revenue opportunities and enterprise demand.

+15

Read Full Article

January 11, 2026

Google Announces a New Protocol to Facilitate Commerce Using AI Agents

Google has announced a new open protocol designed to enable AI agents to participate directly in digital commerce, marking a major shift in how online transactions may be initiated, negotiated, and completed. The move signals Googles ambition to evolve AI from a passive assistant into an autonomous economic actor capable of executing real-world commercial tasks on behalf of users and businesses.

+15

Read Full Article

Back to Newsletter

Reads more articles

Published January 9, 20266 min read

Agentic AI scaling requires new memory architecture.

Agentic AI infrastructure concept

The Memory Bottleneck Holding Back Agentic AI

AI memory hierarchy

In agentic workflows, this KV cache becomes far more than short-term context:

It persists across tool calls and tasks
It grows linearly with sequence length
It acts as the “working memory” of intelligent agents

The result is a new class of data that is latency-critical but ephemeral, and existing hardware architectures are not designed to handle it efficiently.

Why Current Infrastructure Falls Short

Today’s systems force organisations into a costly binary choice: