Managed AI Platforms

01 · Build vs buy 4 min

You can rent almost the
entire AI stack now.

Five years ago, "use AI" meant buying GPUs, wrangling CUDA, and hiring people to keep a model server alive at 3 a.m. Today most teams call an API and ship. A managed AI platform is the service that makes that possible — and the first real decision is how much of the stack you actually want to own.

Managed AI platform — a cloud service that hosts models plus the plumbing around them (authentication, autoscaling, logging, safety filters, billing) behind a simple API. You rent a capability— "summarize this," "classify that" — instead of provisioning the hardware and software that produce it. The app you put on top is covered in Building LLM Apps; the clouds underneath are in Cloud Fundamentals.

What "managed" takes off your plate

GPUs & capacity — no fleet to buy, patch, or scale for spiky traffic.
Serving & uptime— load balancing, batching, and failover are the vendor's problem.
Model access — frontier models you could never train, one API call away.
The extras — safety filters, evals, RAG connectors, usage metering, SOC-2 paperwork.

Managed platforms collapse five layers of ops into one bill — you keep only the app and the prompts.

When the simpler option wins — and it often does

Most products should just call a hosted API. If you are not GPU-bound and not handling data you legally cannot send out, buying is faster, cheaper to start, and someone else carries the pager.
Self-host when the math or the rules demand it — very high, steady volume where per-token cost dominates, strict data residency, air-gapped environments, or deep model customization.
The rest of this deck is a map of the middle ground between "one API call" and "run your own GPUs."

02 · Foundation-model platforms 6 min

The big three:
Bedrock · Vertex AI · Foundry.

The headline product from each hyperscaler is the same shape: one API in front of a catalog of foundation models, with governance, logging, and billing wired into the cloud you already use. They differ most in which models they front and which cloud they marry you to.

Foundation-model platform — a managed service that exposes many pre-trained models through one unified API, plus shared features like retrieval (RAG), safety guardrails, evaluation, and per-call metering. You do not pick a server; you pick a modelId and send text. The platform routes, scales, and bills.

The platform is a switchboard: your app speaks one API; the model behind modelId can change without a rewrite.

Read the schema

One contract. Same request shape whichever model answers — this is exactly the abstraction idea from OOP & Design, applied to vendors.
Shared services — RAG, guardrails, evals, and audit logs sit at the platform layer, not in your code.
Lock-in lives in the extras. Swapping modelId is easy; un-wiring a knowledge base or guardrail is the cost that keeps you.

import boto3 client = boto3.client("bedrock-runtime") out = client.converse( modelId=MODEL, # swap providers — same call messages=[{"role": "user", "content": [{"text": "Summarize Q3"}]}], ) print(out["output"]["message"]["content"][0]["text"])

Bedrock's Converse API normalizes the request — change MODEL and the rest of your code is untouched.

The tooling landscape

AWS Bedrock — the model marketplace for AWS shops

Serverless access to models from Anthropic (Claude), Meta, Mistral, Cohere, AI21, Stability, and Amazon's own Nova and Titan — plus Knowledge Bases (RAG), Guardrails, model evaluation, and Bedrock Agents, all behind IAM.

ClaudeLlamaMistralNova / TitanGuardrails

Pro — widest third-party catalog (Claude included) with native AWS IAM, VPC, and billing; nothing leaves your AWS account boundary.
Con — model availability varies by region, and the API surface is more verbose than a plain OpenAI-style client.
Choose it when — you already run on AWS and want governance and data to stay inside it.

Google Vertex AI — models plus a full ML platform

Google Cloud's unified AI surface. Model Garden fronts Gemini alongside third-party models (Claude, Llama, Mistral), and the same platform also does custom training, pipelines, and the Agent Builder stack — more breadth than a pure model gateway.

GeminiClaudeLlamaModel Garden+ training

Pro— Gemini's long context and multimodality are first-class, and model serving sits next to real ML tooling (covered in Part 3).
Con — the surface is large and the console can feel sprawling; you are squarely on Google Cloud.
Choose it when — you want Gemini, or you will both call and train models in one place.

Azure AI Foundry — the home of OpenAI on Azure

Microsoft's platform (formerly Azure AI Studio) gives enterprise access to OpenAI's GPT models via Azure OpenAI, plus a model catalog (Llama, Mistral, and more), an agent service, prompt flow, evaluations, and Content Safety — all under Azure identity and compliance.

GPTLlamaMistralContent SafetyAgent Service

Pro— the enterprise path to GPT models with Azure's compliance story and tight Microsoft 365 / Entra integration.
Con — fewer non-OpenAI frontier options than Bedrock; quota and deployment management add steps.
Choose it when — you are a Microsoft shop or you specifically need GPT under enterprise controls.

03 · Train your own 5 min

When calling a model
isn't enough.

Sometimes the model you need does not exist yet — a fraud detector on your transactions, a demand forecaster, a classifier on your domain's jargon. For that you want a full ML platform: SageMaker on AWS, Vertex AI on Google. These manage the whole lifecycle, not just inference.

ML platform — a managed environment for the entire model lifecycle: data prep, training at scale, experiment tracking, a model registry, deployed endpoints, and monitoring. A foundation-model API gives you someone else's finished model; an ML platform helps you build, version, and serve your own. The day-to-day discipline lives in MLOps.

The platform manages every stage and the loop back — when live accuracy drifts, you retrain and redeploy.

What SageMaker & Vertex AI give you

Managed training — spin up a GPU cluster for one job, pay for the hours, tear it down automatically.
Registry & versioning — every model is tracked, promotable, and rollback-able.
Serving modes — real-time, serverless, async, and batch endpoints from the same registry.
Feature stores & pipelines — reproducible inputs and automated retraining runs.

Just call a foundation model

# no training, no servers — instant resp = client.converse( modelId=MODEL, messages=msgs, ) # great for language, summary, extraction, # classification with a good prompt

Train + serve your own

# you own data, training, eval, serving est = XGBoost(instance="ml.m5", ...) est.fit(train_data) # managed job model = est.register() # versioned model.deploy(endpoint) # real-time # worth it for tabular / domain-specific tasks

Calling a model is "simple", not "lesser." Reach for the full platform only when no off-the-shelf model fits your data — for classic tabular problems, a small trained model often beats a giant LLM on cost and latency. To adapt an existing model instead of training from scratch, see Fine-tuning.

04 · One API, many providers 4 min

A gateway in front
of every model vendor.

Cloud platforms tie you to one cloud. A model gateway is a thinner, provider-neutral layer: a single endpoint that routes to OpenAI, Anthropic, Google, and open models alike — with failover, spend tracking, and one API key across all of them.

Model gateway — a proxy that exposes one consistent API and forwards each request to whichever provider you name, adding cross-provider routing, automatic failover, caching, and unified billing and observability. It is the Adapter pattern as a service — your code talks to the gateway; the gateway talks to everyone.

One key, one endpoint. The gateway sends to your primary model and fails over to a backup if it errors or rate-limits.

Why teams add one

Resilience — a provider outage or 429 fails over instead of taking you down.
Flexibility — A/B a new model by changing one string; no new SDK, no new key.
One bill, one dashboard — spend, latency, and token usage across every provider in one place.
The catch — you add a hop (a little latency) and trust a middleman with your traffic.

import { generateText } from "ai" const { text } = await generateText({ model: "anthropic/claude-...", // creator/model slug prompt, }) // change "openai/gpt-..." → new provider, same code // the gateway resolves auth, routing + failover

A creator/model slug is all that changes between providers — the gateway handles the rest.

Vercel AI Gateway

Wired into the AI SDK

Pro — first-class with the Vercel AI SDK; swap models by slug, get spend limits, caching, and observability with almost no glue.
Con — best inside the Vercel / AI SDK ecosystem; less compelling if your stack lives elsewhere.
Choose it when — you build on the AI SDK and want routing and failover for free.

OpenRouter

The broadest catalog

Pro — hundreds of models behind one OpenAI-compatible API and one bill, with price- and latency-aware routing and easy fallbacks.
Con — a third party sits in your request path; vet data handling and the markup on tokens.
Choose it when — you want maximum model choice and provider-neutral routing fast.

Prefer to self-host the gateway? LiteLLM is a popular open-source proxy that speaks the same OpenAI-compatible API across providers — same idea, you run it.

05 · Hosted vs local / open 6 min

Hosted frontier models vs
open models you run.

The sharpest trade-off in this whole space: call a proprietary model (Claude, GPT, Gemini) over the network, or run an open-weight model (Llama, Mistral) on hardware you control. Both are legitimate. The honest answer depends on cost shape, data rules, and how much ops you can stomach.

Open-weight model — a model whose trained parameters are published, so anyone can download and run it on their own hardware (Llama, Mistral, Gemma, Qwen). It is not the same as "open source" — the license still sets the rules, and the training data usually is not released. The opposite is a hosted proprietary modelyou can only reach through the vendor's API.

Hosted: data leaves, you pay per token, you run nothing. Local: data stays, you pay for the box whether it is busy or idle.

The five honest trade-offs

Cost shape — hosted is pay-per-token (cheap to start, scales with use); local is fixed GPU cost (cheaper only at high, steady volume).
Privacy / residency — local keeps every byte inside your network; decisive for regulated or air-gapped data.
Control — local lets you pin versions and fine-tune freely; hosted models can change or deprecate under you.
Quality — frontier hosted models still lead the hardest tasks, though the open gap keeps narrowing.
Ops burden — hosted is near-zero; local means GPUs, serving, scaling, and upgrades are now your job.

Ollama — local, the easy way

# one binary, great for a laptop or a dev box ollama pull mistral # weights download once ollama run mistral # chat in the terminal # OpenAI-compatible API at :11434/v1 # simple — but single-box, modest throughput

vLLM — local, for real traffic

# high-throughput serving on real GPUs vllm serve mistralai/Mistral-... # or a Llama ckpt # PagedAttention + continuous batching # OpenAI-compatible server at :8000/v1 # fast at scale — but you size + run the cluster

Ollama

Run a model in minutes

Pro — trivially easy local inference for prototyping, demos, and offline work; one command to a running model.
Con — built around single-box use; not a high-concurrency production server.
Choose it when — developing locally or running light, private workloads.

vLLM

Serve open models at scale

Pro — high throughput and low latency via PagedAttention and continuous batching; the workhorse for self-hosted serving.
Con — you own GPU sizing, autoscaling, and uptime — real MLOps work.
Choose it when — you have steady volume and a reason (cost or data) to keep inference in-house.

Running open models in production is its own discipline — serving, autoscaling, and monitoring are covered in MLOps. For most teams, start hosted and only self-host once cost or compliance forces the move.

06 · Agent platforms 5 min

From one call to a
loop that uses tools.

An agent wraps a model in a loop: it reasons, calls tools, reads results, and repeats until the task is done. The clouds now offer managed runtimes so you do not hand-roll the loop, memory, and tool plumbing. The deep dive lives in AI Agents & Tool Use — here we map the platforms.

Agent platform — a managed runtime and toolkit for building agents: it handles the reason-act loop, tool / function calling, memory and session state, and safe execution — so you wire up tools and goals, not the orchestration. Think of it as the platform layer beneath the agent patterns from AI Agents & Tool Use.

The runtime owns the loop — calling tools, threading memory, and re-prompting the model until the goal is met.

What the runtime spares you

The loop — parsing tool calls, executing them, feeding results back, deciding when to stop.
State — multi-turn sessions and memory that survive across steps and requests.
Safe execution — sandboxed tools, timeouts, and guardrails on what the agent can do.
Deploy & scale — a managed place to host the agent, with tracing for when it goes sideways.

Google · Vertex AI Agent Builder

Build, deploy, and host agents on Google Cloud.

Vertex AI Agent Builder is Google's managed agent stack. You build with the open-source Agent Development Kit (ADK), then deploy onto a managed runtime (Agent Engine) that handles sessions, scaling, and tracing — agents that lean on Gemini and your tools.

ADK — an open-source framework for defining agents, tools, and multi-agent workflows in code.
Agent Engine — the managed runtime that hosts and scales those agents for you.

A2A · the Agent2Agent protocol

An open standard for agents to talk to each other.

A2A (Agent2Agent) is an open protocol — now under the Linux Foundation — for agents built by different teams or vendors to discover each other and collaborate. It is the interop layer: where MCP standardizes how an agent reaches tools, A2A standardizes how agents reach other agents.

AWS · Bedrock Agents

Orchestrate Bedrock models with your tools and data.

Bedrock Agents connect a foundation model to action groups (your functions, typically AWS Lambda) and Knowledge Bases(managed RAG), so the model can take real steps and ground answers in your data — all inside AWS's IAM and logging. AWS also offers AgentCore, a newer managed runtime for deploying agents (including ones built with other frameworks) with memory, identity, and observability.

When to reach for a platform vs roll your own

The simpler option still wins more often than you think.

For a single agent with a few tools, a plain loop in your own code (or a light framework) is often enough — see AI Agents & Tool Use. Reach for a managed platform when you need durable sessions, multi-agent coordination, sandboxed tool execution, or enterprise-grade tracing and identity. Do not adopt a heavy runtime for a one-shot prompt.

07 · Choosing & recap 4 min

Four levers decide
where your AI runs.

Strip away the brand names and almost every choice comes down to four questions: what it costs, where the data may live, how fast it must answer, and how hard it is to leave.

Cost

Spiky or low volume → pay-per-token hosted. High, steady volume → self-host can win on unit cost. Model the real traffic, not the demo.

Data residency

Regulated or air-gapped data → keep inference in your boundary (cloud platform in-region, or local). Check retention and region terms first.

Latency

Add up the hops: gateways and agent loops cost milliseconds. Co-locate the model with the app for tight, interactive paths.

Lock-in

Swapping a model is cheap; un-wiring knowledge bases, guardrails, and agent runtimes is not. A gateway hedges; deep platform features bind.

A rough path, not a law: data rules and volume push you off the hosted default; multi-provider needs add a gateway.

Default to a hosted API on the cloud you already use — it is the cheapest way to learn what you actually need.
Add a gateway the moment you want failover or to compare providers.
Move to a full ML platform only when no off-the-shelf model fits — then MLOps is your home.
Self-host open models when cost at scale or data residency makes the ops worth it — not before.

1Buy the boring parts. Rent serving, scaling, and safety so you can spend effort on the product.

2Program to the API, not the vendor. One modelId or slug should be all that changes between models.

3Hosted first, self-host on purpose. Open models earn their ops only at steady scale or under strict data rules.

4Lock-in hides in the extras. Watch knowledge bases, guardrails, and agent runtimes — that is what is costly to leave.

5The simplest thing that works wins. A gateway, a platform, an agent runtime — add each only when the pain is real.

Knowledge check

Did it stick?

Five quick questions on managed platforms, gateways, hosted vs local, and agents — instant feedback, no sign-in.

Rate this deck

be the first

Navigate with ← → or scroll · back to library

Managed AIPlatforms — buythe boring parts.

You can rent almost theentire AI stack now.

What "managed" takes off your plate

When the simpler option wins — and it often does

The big three:Bedrock · Vertex AI · Foundry.

Read the schema

The tooling landscape

AWS Bedrock — the model marketplace for AWS shops

Google Vertex AI — models plus a full ML platform

Azure AI Foundry — the home of OpenAI on Azure

When calling a modelisn't enough.

What SageMaker & Vertex AI give you

A gateway in frontof every model vendor.

Why teams add one

Wired into the AI SDK

The broadest catalog

Hosted frontier models vsopen models you run.

The five honest trade-offs

Run a model in minutes

Serve open models at scale

From one call to aloop that uses tools.

What the runtime spares you

Four levers decidewhere your AI runs.

Did it stick?

Managed AI
Platforms — buy
the boring parts.

You can rent almost the
entire AI stack now.

The big three:
Bedrock · Vertex AI · Foundry.

When calling a model
isn't enough.

A gateway in front
of every model vendor.

Hosted frontier models vs
open models you run.

From one call to a
loop that uses tools.

Four levers decide
where your AI runs.