Technology

Discuss the infrastructure requirements of Agentic AI.

The rise of Agentic AI places unprecedented demands on our infrastructure. This article explores the emerging software and hardware requirements, from specialized runtimes and memory services to zero-trust security models, dissecting AWS's new Bedrock AgentCore platform and discussing the future of AI infrastructure.
Noll
13 min read
#Technology#AI#Innovation

Of course. Here is the enhanced and anglicized version of the content, following all your guidelines.


I’ve been deep in the trenches with Agentic AI applications lately, and a recurring theme is the sheer demand these systems place on infrastructure. As I was grappling with the challenges in my own development work, AWS conveniently dropped a suite of new products at their 2024 New York City Summit1, including Amazon Bedrock AgentCore and S3 vector buckets. This announcement came just two weeks after they unveiled their custom GB200 instances, and it’s no coincidence.

See: "A Look at the AWS GB200 Instance"

In fact, the launch of Bedrock AgentCore might just be the missing piece of the puzzle, explaining why AWS is so keen on merging the front-end CPU network with the GPU's scale-out fabric.

But the infrastructure demands of Agentic AI go far beyond networking. We're talking about specialized virtualized environments for agent execution, sophisticated in-memory storage for Context Engineering, secure sandboxes for code-generating models, and the monumental task of coordinating Multi-Agent Systems.

This article will break down these emerging software and hardware requirements and explore the potential evolution of the infrastructure that will power the next generation of AI.

AWS Unveils Its End-to-End Agentic Infra: Bedrock AgentCore

The AWS keynote in New York kicked off by addressing a pain point every developer in this space knows well: deploying AI agents into production is hard. Really hard.

A slide showing challenges in building generative AI agents, including security, access control, tool discovery, and observability.

The core challenges aren't just about model performance; they're about the nitty-gritty operational details: security isolation, authentication, access control, observability, reliable code execution, and enabling agents to discover and use tools (often called MCP/A2A, or Model-as-a-Controller-Plane / Agent-to-Agent communication).

To tackle this complex web of problems, AWS introduced Bedrock AgentCore, a platform built on seven core components.

A diagram showing the seven core components of Bedrock AgentCore: Agent Runtime, Memory Service, Tool Integration, Code Interpreter, Browser, Tool Gateway, and Observability.

This isn't just a simple extension of traditional cloud services or a copy-paste of old paradigms like FaaS or microservices. Bedrock AgentCore represents a fundamental infrastructure shift, purpose-built for the unique demands of Agentic Software. Let's break it down.

Agent Runtime

At its heart is a secure, serverless runtime environment designed specifically for agents. It uses lightweight MicroVMs to provide robust, session-level security isolation, complete with built-in identity authentication and lightning-fast cold starts. This runtime is framework-agnostic, allowing you to bring your own tools like CrewAI or LangGraph (or use AWS's own Strands), connect to any open-source tool, and use any model to dynamically expand an agent's capabilities.

A slide detailing the Agent Runtime, highlighting its serverless, secure, and flexible nature, with support for various frameworks and tools.

What's fascinating here is the use of MicroVMs to handle complex, asynchronous workloads that can last up to eight hours. This is a world away from traditional stateless serverless functions. Agent execution is inherently stateful; it needs to maintain a significant amount of context. It might even require execution checkpoints, allowing it to roll back to a previous snapshot and retry a step if something goes wrong.

Furthermore, agents often operate with specific identities and permissions to handle sensitive data, making the security of their execution environment paramount. Traditional containers, even those used in some AWS Lambda services, simply don't offer this level of hardened isolation.

Memory Service

This component manages an agent's short-term and long-term memory, providing the crucial context the model needs to reason and act. It leverages technologies like vector databases and the newly announced Amazon S3 Vectors storage.

A slide explaining the Memory Service, distinguishing between short-term memory for conversational context and long-term memory for persistent knowledge.

AWS has created a powerful abstraction here. Short-term memory is for the here and now, tracking contextual information in real-time during a conversation, like raw interaction events. Long-term memory, on the other hand, stores persistent knowledge across sessions, such as user preferences, conversation summaries, and semantic facts.

Tool Integration

This enables AI agents to seamlessly and securely connect with AWS services and third-party tools like GitHub, Salesforce, and Slack. By leveraging user identity or pre-authorized consent, agents can use secure credentials to safely access external resources.

A slide on Tool Integration, showing how agents can securely connect to various enterprise systems and SaaS applications.

Code Interpreter

This provides a sandboxed, isolated code execution environment for agents to perform computations, data analysis, or run generated scripts safely.

A slide describing the Code Interpreter as a secure, sandboxed environment for running agent-generated code.

Browser

This is a model-agnostic browser tool that gives agents the ability to interact with the web. It operates in a secure, sandboxed environment with full VM-level isolation, comprehensive auditing, and session-level privacy.

A slide detailing the Browser tool, emphasizing its security features like VM-level sandboxing and session-level isolation.

Tool Gateway

The Gateway offers a simple and secure way for developers to build, deploy, discover, and connect to external tools at scale. One of its most powerful features is the ability to automatically convert existing APIs (like RESTful or GraphQL), Lambda functions, and other services into agent-compatible tools (MCP).

A slide explaining the Tool Gateway, which simplifies the process of making external tools and APIs available to agents.

Observability

You can't fix what you can't see. This component provides deep observability into agent execution, allowing developers to trace, debug, and monitor agent performance in production. It supports standard formats like OpenTelemetry and integrates tightly with AWS's own CloudWatch service.

A slide on Observability, showing how developers can trace, debug, and monitor agent performance using integrated tools like CloudWatch.

Drilling Down: The Runtime Environment

Two aspects of the Agent Runtime deserve a closer look.

First is session-level isolation. Traditional runc-based containers have known security vulnerabilities, which is why AWS is wisely adopting MicroVM technology like Firecracker for this scenario. But it's not just about isolating one session from another; multi-tenant isolation is also critical. When sensitive applications and data are deployed within a VPC, the runtime's MicroVMs must also reside inside that VPC. Secure data access might even necessitate a zero-trust framework.

Second, for "Computer-Use" agents that need to interact with desktop applications, the runtime must support virtual desktops and common GUI functions. Similarly, to break down data silos within mobile apps, "Mobile-Use" scenarios are becoming increasingly important.

All operations and calls made to these VMs via the agent's control plane (MCP) need corresponding observability logs. A relatively mature solution in this space is Alibaba Cloud's Wuying AgentBay. It uses MCP to fully support a range of use cases, including Windows/Linux Computer-Use, Browser-Use, Code Interpreter, and Mobile-Use. It can spin up VMs on a per-session basis, and their GUIs can be observed in real-time through a browser. By leveraging cloud virtualization and massive compute pools, it can even run multiple VMs simultaneously to tackle complex tasks.

A diagram showing the architecture of Alibaba Cloud's Wuying AgentBay, which supports various use cases like Computer-Use, Browser-Use, and Mobile-Use.

AgentBay also allows for custom images to support a wider variety of applications and can be deployed within a specified VPC. In essence, the AgentBay platform provides a superset of the Browser and Code Interpreter components in Bedrock AgentCore. For example, we can pair the Kimi K2 model with the AgentBay Code MCP to execute a task for retrieving and analyzing stock market data.

A screenshot showing a code execution task within the AgentBay platform, with Kimi model generating code to analyze stock data.

A screenshot showing the output of the stock analysis task, displaying a K-line chart within the AgentBay interface.

Platforms like AgentBay are also incredibly valuable for training large model-based agents. Particularly during Reinforcement Learning (RL), the ability to use VM snapshots makes it trivial to perform an "Environment Rollback"—reverting to a previous state to explore a different action path. This is exactly the kind of simulation-based training Elon Musk is focused on.

A screenshot of a tweet from Elon Musk about training AI in a simulated environment.

The Memory Service: The Hardest Problem to Solve

The memory layer is arguably the most challenging piece of the Agentic AI infrastructure puzzle. On one hand, the application itself demands extensive Context Engineering. On the other, you have the staggering complexity of managing context across multi-turn conversations, especially in Multi-Agent scenarios. AWS's abstraction of Short-Term and Long-Term Memory is definitely the right way to think about it.

A diagram illustrating the concept of short-term and long-term memory for AI agents.

For example, while building a stock market quantitative trading agent, real-time market data, company announcements, and forum comments constituted the Short-Term Memory. The insights derived from this data—like K-line chart analysis and sentiment analysis—became Long-Term Memory. In the future, user preferences and behaviors will also be stored in Long-Term Memory, functioning much like an embedding table in a modern recommendation system.

This Short-Term/Long-Term structure effectively creates a hot/cold tiering for storage. This means object storage like S3 needs a hot cache layer and powerful search capabilities, like those offered by the new S3 Vectors. But more importantly, when multiple agents are executing in parallel, you run headfirst into massive data consistency challenges. I've touched on this before, and it's a problem that has deep roots in distributed systems.

See: "Large Model Inference Systems (Part 1) -- Starting with the DEC VAXcluster Distributed System"

This is the very problem that led to technologies like Oracle RAC CacheFusion and IBM pureScale. In the cloud era, we see it addressed in three-tiered disaggregated architectures like Alibaba Cloud's PolarDB Serverless.

However, a critical question remains: how do you implement multi-tenant isolation and compute-storage disaggregation for the Memory Service? This applies not only to user-facing text data but also to the enormous amount of KV Cache generated during model inference. This inevitably leads to the problem of building a shared memory pool. Do you use an independent, multi-tenant memory pool built on something like a CXL Switch? Unfortunately, no matter how capable CXL becomes, it's not on the GPU Scale-Up roadmap.

This dilemma points to two potential paths forward:

  1. The GPU-Centric Path: From the GPU's perspective, the ideal is direct Load/Store (LD/ST) semantics to eliminate the latency and unpredictability of KV Cache transfers, which waste precious GPU cycles. AWS's current approach with the GB200 seems to be a VPC-supported Scale-Out solution. Data can be transferred first to the Grace CPU, and then to the Blackwell GPU via the high-speed NVLink C2C interconnect. A diagram showing data transfer from a Grace CPU to a Blackwell GPU via NVLink C2C within a VPC. However, this data transfer process requires robust congestion control and priority scheduling to avoid interfering with other critical network traffic.

  2. The CPU-Centric Path: On the flip side, the Grace CPU also needs multi-tenant isolation capabilities. A more secure approach would be to store the KV Cache in a user-controlled Runtime (like the Agent Runtime's MicroVM). This way, the cloud provider's inference platform caches as little user data as possible, creating a trade-off between performance and isolation.

Following this logic, a new hardware requirement emerges: x86 processors from AMD and Intel need to support UALink. For a chiplet-based architecture like AMD's, could they re-tapeout an I/O Die to swap some PCIe lanes for UALink, leaving the core compute dies unchanged? The upcoming Turin generation, with its high core counts and large memory capacity, is already well-suited for creating numerous session-based VMs for tenants. UALink is particularly well-designed for this, featuring a built-in MMU for multi-tenant isolation—a more complete solution than Broadcom's SUE.

If AMD and Intel were to pioneer UALink support on CPUs, it would be a massive win for the entire ecosystem. Nearly all non-NVIDIA GPU vendors would have a standard to rally around, much like AGP and PCIe did for graphics cards in the past. Debugging UALink interconnects on CPU nodes would be simpler, and third-party memory vendors like Samsung would have a clear roadmap beyond CXL. It could even enable the creation of pure memory nodes within the Scale-Up domain.

Security: A Zero Trust Approach

When you peel back the layers, the security model required for Agentic AI is effectively a Zero Trust Architecture (ZTNA). The principles of the Software-Defined Perimeter (SDP) provide an excellent blueprint that can be applied to the Agent Runtime and the backend inference platform. In a traditional SDP deployment, clients connect securely to protected resources. For Agentic AI, the Agent Runtimes act as the SDP clients, while the large model inference platform and an enterprise's internal applications are the resources, interconnected via secure application gateways.

A diagram of a traditional Software-Defined Perimeter (SDP) architecture for enterprise networks.

This part of the puzzle can be solved by directly adopting the battle-tested ZTNA SDP model.

Context Engineering & The Demands on Models

I plan to dive deeper into the specifics of Context Engineering in my 'Agent101' series, but for now, the work from Manus2 perfectly illustrates the core challenge:

A diagram from Manus illustrating the challenges of managing growing context and KV Cache in AI agents.

Manus has done extensive work to improve the KV Cache hit rate, but it's incredibly difficult to escape the fundamental problem of a sequential context stack that just keeps growing. Eventually, it blows past the model's context length limit. For instance, when I was building the stock trading agent, I hit a wall: how do you analyze hundreds of stocks in the S&P 500 simultaneously while maintaining a high cache hit rate and not overwhelming the context window?

The fundamental problem is analogous to how early processors and operating systems evolved to use virtual memory and paging. We're seeing the first hints of a solution in techniques like DeepSeek's Native Sequence Attention (NSA).

A diagram explaining DeepSeek's Native Sequence Attention (NSA) which uses a section-based approach to manage long contexts.

This idea of "paging" context based on 'sections' could be a powerful solution, especially since in many tasks, the data at the boundaries of these sections has no direct attention correlation.

This also makes you wonder: what kind of advanced context and caching mechanisms are being used by the models from OpenAI and DeepMind that are winning IMO gold medals and can "think" for hours on a single problem?

These, I believe, are the next frontiers for the models themselves.

Conclusion

From MicroVMs to UALink, the rise of Agentic AI is forcing a fundamental rethink of our entire infrastructure stack. In this article, I've outlined the key demands as I see them, exploring potential shifts in model architecture (Block/Section-based Attention), hardware (x86 support for UALink), agent runtimes (MicroVMs), security (Zero Trust), and the critical memory layer. I offer these thoughts as a starting point for a broader conversation and hope they are a useful reference for my peers across the industry.


Footnotes

  1. AWS Summit New York City 2024: https://www.youtube.com/watch?v=2890bEb61qQ

  2. Manus Context Engineering: https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus

Related Articles

Technology
7 min

Two Major Challenges in Reinforcement Learning Finally Solved by ICLR Papers

Traditional reinforcement learning models struggle with real-time applications due to "AI lag." Two ICLR 2025 papers from Mila introduce groundbreaking solutions to tackle inaction and delay regret, enabling large AI models to operate in high-frequency, dynamic environments without compromising speed or intelligence.

Noll
TechnologyAI+1 more
Technology
20 min

From DeepSeek-V3 to Kimi K2:Eight Modern Large Language Model Architecture Designs

This article dissects the architectural evolution of modern large language models in 2025, moving beyond benchmarks to analyze the core design choices of flagship open-source models. We explore key innovations like DeepSeek-V3's Multi-Head Latent Attention (MLA) and Mixture of Experts (MoE), OLMo 2's unique normalization strategies, Gemma 3's use of sliding window attention, and Llama 4's take on MoE. By focusing on these architectural blueprints, we gain a clearer understanding of the engineering priorities shaping the future of LLMs.

Noll
TechnologyAI+1 more
Technology
3 min

First Principles of GPU Performance

This article delves into the core challenges of GPU performance, analyzing the differences between compute-bound and memory-bound operations and highlighting the issue of underutilized memory bandwidth. It further proposes strategies to maximize throughput and looks ahead to the collaborative future of CPUs and GPUs, as well as the evolution of GPU architecture, offering a first-principles perspective on understanding and optimizing GPU performance.

xiaodong gong
TechnologyAI+1 more

About This Article

Topic: Technology
Difficulty: Intermediate
Reading Time: 13 minutes
Last Updated: July 23, 2025

This article is part of our comprehensive guide to Large Language Models and AI technologies. Stay updated with the latest developments in the AI field.

All Articles
Share this article to spread LLM knowledge