Library
00/07 · ~34 min
GUIDEDECK · renting AI infrastructure instead of running it

Managed AI
Platforms — buy
the boring parts.

A 34-minute working session on the services that host models and the plumbing around them — AWS Bedrock, Google Vertex AI, Azure AI Foundry, model gateways, agent runtimes — and the honest call between a hosted API and a model you run yourself.

~34 MINBEGINNER → INTERMEDIATECLOUD & AI
SCROLL
01 · Build vs buy 4 min

You can rent almost the
entire AI stack now.

Five years ago, "use AI" meant buying GPUs, wrangling CUDA, and hiring people to keep a model server alive at 3 a.m. Today most teams call an API and ship. A managed AI platform is the service that makes that possible — and the first real decision is how much of the stack you actually want to own.

Managed AI platform a cloud service that hosts models plus the plumbing around them (authentication, autoscaling, logging, safety filters, billing) behind a simple API. You rent a capability— "summarize this," "classify that" — instead of provisioning the hardware and software that produce it. The app you put on top is covered in Building LLM Apps; the clouds underneath are in Cloud Fundamentals.

What "managed" takes off your plate

  • GPUs & capacity — no fleet to buy, patch, or scale for spiky traffic.
  • Serving & uptime— load balancing, batching, and failover are the vendor's problem.
  • Model access — frontier models you could never train, one API call away.
  • The extras — safety filters, evals, RAG connectors, usage metering, SOC-2 paperwork.
SELF-MANAGED MANAGED Your app + prompts serving / API scaling + batching model weights GPU fleet drivers + ops Your app + prompts platform (everything below)

Managed platforms collapse five layers of ops into one bill — you keep only the app and the prompts.

When the simpler option wins — and it often does

  • Most products should just call a hosted API. If you are not GPU-bound and not handling data you legally cannot send out, buying is faster, cheaper to start, and someone else carries the pager.
  • Self-host when the math or the rules demand it — very high, steady volume where per-token cost dominates, strict data residency, air-gapped environments, or deep model customization.
  • The rest of this deck is a map of the middle ground between "one API call" and "run your own GPUs."
02 · Foundation-model platforms 6 min

The big three:
Bedrock · Vertex AI · Foundry.

The headline product from each hyperscaler is the same shape: one API in front of a catalog of foundation models, with governance, logging, and billing wired into the cloud you already use. They differ most in which models they front and which cloud they marry you to.

Foundation-model platform a managed service that exposes many pre-trained models through one unified API, plus shared features like retrieval (RAG), safety guardrails, evaluation, and per-call metering. You do not pick a server; you pick a modelId and send text. The platform routes, scales, and bills.
your app platform API auth · log · meter Claude Llama Mistral Gemini / Nova Cohere · Titan

The platform is a switchboard: your app speaks one API; the model behind modelId can change without a rewrite.

Read the schema

  • One contract. Same request shape whichever model answers — this is exactly the abstraction idea from OOP & Design, applied to vendors.
  • Shared services — RAG, guardrails, evals, and audit logs sit at the platform layer, not in your code.
  • Lock-in lives in the extras. Swapping modelId is easy; un-wiring a knowledge base or guardrail is the cost that keeps you.
import boto3 client = boto3.client("bedrock-runtime") out = client.converse( modelId=MODEL, # swap providers — same call messages=[{"role": "user", "content": [{"text": "Summarize Q3"}]}], ) print(out["output"]["message"]["content"][0]["text"])
converse() modelId one string Claude Llama Nova

Bedrock's Converse API normalizes the request — change MODEL and the rest of your code is untouched.

The tooling landscape

AWS Bedrock — the model marketplace for AWS shops

Serverless access to models from Anthropic (Claude), Meta, Mistral, Cohere, AI21, Stability, and Amazon's own Nova and Titan — plus Knowledge Bases (RAG), Guardrails, model evaluation, and Bedrock Agents, all behind IAM.

ClaudeLlamaMistralNova / TitanGuardrails
  • Pro — widest third-party catalog (Claude included) with native AWS IAM, VPC, and billing; nothing leaves your AWS account boundary.
  • Con — model availability varies by region, and the API surface is more verbose than a plain OpenAI-style client.
  • Choose it when — you already run on AWS and want governance and data to stay inside it.

Google Vertex AI — models plus a full ML platform

Google Cloud's unified AI surface. Model Garden fronts Gemini alongside third-party models (Claude, Llama, Mistral), and the same platform also does custom training, pipelines, and the Agent Builder stack — more breadth than a pure model gateway.

GeminiClaudeLlamaModel Garden+ training
  • Pro— Gemini's long context and multimodality are first-class, and model serving sits next to real ML tooling (covered in Part 3).
  • Con — the surface is large and the console can feel sprawling; you are squarely on Google Cloud.
  • Choose it when — you want Gemini, or you will both call and train models in one place.

Azure AI Foundry — the home of OpenAI on Azure

Microsoft's platform (formerly Azure AI Studio) gives enterprise access to OpenAI's GPT models via Azure OpenAI, plus a model catalog (Llama, Mistral, and more), an agent service, prompt flow, evaluations, and Content Safety — all under Azure identity and compliance.

GPTLlamaMistralContent SafetyAgent Service
  • Pro— the enterprise path to GPT models with Azure's compliance story and tight Microsoft 365 / Entra integration.
  • Con — fewer non-OpenAI frontier options than Bedrock; quota and deployment management add steps.
  • Choose it when — you are a Microsoft shop or you specifically need GPT under enterprise controls.
03 · Train your own 5 min

When calling a model
isn't enough.

Sometimes the model you need does not exist yet — a fraud detector on your transactions, a demand forecaster, a classifier on your domain's jargon. For that you want a full ML platform: SageMaker on AWS, Vertex AI on Google. These manage the whole lifecycle, not just inference.

ML platform a managed environment for the entire model lifecycle: data prep, training at scale, experiment tracking, a model registry, deployed endpoints, and monitoring. A foundation-model API gives you someone else's finished model; an ML platform helps you build, version, and serve your own. The day-to-day discipline lives in MLOps.
data train registry endpoint monitor + drift retrain when it slips

The platform manages every stage and the loop back — when live accuracy drifts, you retrain and redeploy.

What SageMaker & Vertex AI give you

  • Managed training — spin up a GPU cluster for one job, pay for the hours, tear it down automatically.
  • Registry & versioning — every model is tracked, promotable, and rollback-able.
  • Serving modes — real-time, serverless, async, and batch endpoints from the same registry.
  • Feature stores & pipelines — reproducible inputs and automated retraining runs.
Just call a foundation model
# no training, no servers — instant resp = client.converse( modelId=MODEL, messages=msgs, ) # great for language, summary, extraction, # classification with a good prompt
Train + serve your own
# you own data, training, eval, serving est = XGBoost(instance="ml.m5", ...) est.fit(train_data) # managed job model = est.register() # versioned model.deploy(endpoint) # real-time # worth it for tabular / domain-specific tasks

Calling a model is "simple", not "lesser." Reach for the full platform only when no off-the-shelf model fits your data — for classic tabular problems, a small trained model often beats a giant LLM on cost and latency. To adapt an existing model instead of training from scratch, see Fine-tuning.

04 · One API, many providers 4 min

A gateway in front
of every model vendor.

Cloud platforms tie you to one cloud. A model gateway is a thinner, provider-neutral layer: a single endpoint that routes to OpenAI, Anthropic, Google, and open models alike — with failover, spend tracking, and one API key across all of them.

Model gateway a proxy that exposes one consistent API and forwards each request to whichever provider you name, adding cross-provider routing, automatic failover, caching, and unified billing and observability. It is the Adapter pattern as a service — your code talks to the gateway; the gateway talks to everyone.
your app gateway route · retry · meter OpenAI (primary) Anthropic Google · open fallback

One key, one endpoint. The gateway sends to your primary model and fails over to a backup if it errors or rate-limits.

Why teams add one

  • Resilience — a provider outage or 429 fails over instead of taking you down.
  • Flexibility — A/B a new model by changing one string; no new SDK, no new key.
  • One bill, one dashboard — spend, latency, and token usage across every provider in one place.
  • The catch — you add a hop (a little latency) and trust a middleman with your traffic.
import { generateText } from "ai" const { text } = await generateText({ model: "anthropic/claude-...", // creator/model slug prompt, }) // change "openai/gpt-..." → new provider, same code // the gateway resolves auth, routing + failover
"creator/model" one string anthropic/... openai/... google/...

A creator/model slug is all that changes between providers — the gateway handles the rest.

Vercel AI Gateway

Wired into the AI SDK

  • Pro — first-class with the Vercel AI SDK; swap models by slug, get spend limits, caching, and observability with almost no glue.
  • Con — best inside the Vercel / AI SDK ecosystem; less compelling if your stack lives elsewhere.
  • Choose it when — you build on the AI SDK and want routing and failover for free.
OpenRouter

The broadest catalog

  • Pro — hundreds of models behind one OpenAI-compatible API and one bill, with price- and latency-aware routing and easy fallbacks.
  • Con — a third party sits in your request path; vet data handling and the markup on tokens.
  • Choose it when — you want maximum model choice and provider-neutral routing fast.

Prefer to self-host the gateway? LiteLLM is a popular open-source proxy that speaks the same OpenAI-compatible API across providers — same idea, you run it.

05 · Hosted vs local / open 6 min

Hosted frontier models vs
open models you run.

The sharpest trade-off in this whole space: call a proprietary model (Claude, GPT, Gemini) over the network, or run an open-weight model (Llama, Mistral) on hardware you control. Both are legitimate. The honest answer depends on cost shape, data rules, and how much ops you can stomach.

Open-weight model a model whose trained parameters are published, so anyone can download and run it on their own hardware (Llama, Mistral, Gemma, Qwen). It is not the same as "open source" — the license still sets the rules, and the training data usually is not released. The opposite is a hosted proprietary modelyou can only reach through the vendor's API.
HOSTED LOCAL / OPEN your boundary app vendor cloud $ per token data leaves your boundary app your GPU fixed $ / hour data stays in

Hosted: data leaves, you pay per token, you run nothing. Local: data stays, you pay for the box whether it is busy or idle.

The five honest trade-offs

  • Cost shape — hosted is pay-per-token (cheap to start, scales with use); local is fixed GPU cost (cheaper only at high, steady volume).
  • Privacy / residency — local keeps every byte inside your network; decisive for regulated or air-gapped data.
  • Control — local lets you pin versions and fine-tune freely; hosted models can change or deprecate under you.
  • Quality — frontier hosted models still lead the hardest tasks, though the open gap keeps narrowing.
  • Ops burden — hosted is near-zero; local means GPUs, serving, scaling, and upgrades are now your job.
Ollama — local, the easy way
# one binary, great for a laptop or a dev box ollama pull mistral # weights download once ollama run mistral # chat in the terminal # OpenAI-compatible API at :11434/v1 # simple — but single-box, modest throughput
vLLM — local, for real traffic
# high-throughput serving on real GPUs vllm serve mistralai/Mistral-... # or a Llama ckpt # PagedAttention + continuous batching # OpenAI-compatible server at :8000/v1 # fast at scale — but you size + run the cluster
Ollama

Run a model in minutes

  • Pro — trivially easy local inference for prototyping, demos, and offline work; one command to a running model.
  • Con — built around single-box use; not a high-concurrency production server.
  • Choose it when — developing locally or running light, private workloads.
vLLM

Serve open models at scale

  • Pro — high throughput and low latency via PagedAttention and continuous batching; the workhorse for self-hosted serving.
  • Con — you own GPU sizing, autoscaling, and uptime — real MLOps work.
  • Choose it when — you have steady volume and a reason (cost or data) to keep inference in-house.

Running open models in production is its own discipline — serving, autoscaling, and monitoring are covered in MLOps. For most teams, start hosted and only self-host once cost or compliance forces the move.

06 · Agent platforms 5 min

From one call to a
loop that uses tools.

An agent wraps a model in a loop: it reasons, calls tools, reads results, and repeats until the task is done. The clouds now offer managed runtimes so you do not hand-roll the loop, memory, and tool plumbing. The deep dive lives in AI Agents & Tool Use — here we map the platforms.

Agent platform a managed runtime and toolkit for building agents: it handles the reason-act loop, tool / function calling, memory and session state, and safe execution — so you wire up tools and goals, not the orchestration. Think of it as the platform layer beneath the agent patterns from AI Agents & Tool Use.
model (reason) tools / APIs search · db · code memory session state runtime the loop act → observe → repeat

The runtime owns the loop — calling tools, threading memory, and re-prompting the model until the goal is met.

What the runtime spares you

  • The loop — parsing tool calls, executing them, feeding results back, deciding when to stop.
  • State — multi-turn sessions and memory that survive across steps and requests.
  • Safe execution — sandboxed tools, timeouts, and guardrails on what the agent can do.
  • Deploy & scale — a managed place to host the agent, with tracing for when it goes sideways.
G
Google · Vertex AI Agent Builder
Build, deploy, and host agents on Google Cloud.
+

Vertex AI Agent Builder is Google's managed agent stack. You build with the open-source Agent Development Kit (ADK), then deploy onto a managed runtime (Agent Engine) that handles sessions, scaling, and tracing — agents that lean on Gemini and your tools.

  • ADK — an open-source framework for defining agents, tools, and multi-agent workflows in code.
  • Agent Engine — the managed runtime that hosts and scales those agents for you.
A
A2A · the Agent2Agent protocol
An open standard for agents to talk to each other.
+

A2A (Agent2Agent) is an open protocol — now under the Linux Foundation — for agents built by different teams or vendors to discover each other and collaborate. It is the interop layer: where MCP standardizes how an agent reaches tools, A2A standardizes how agents reach other agents.

B
AWS · Bedrock Agents
Orchestrate Bedrock models with your tools and data.
+

Bedrock Agents connect a foundation model to action groups (your functions, typically AWS Lambda) and Knowledge Bases(managed RAG), so the model can take real steps and ground answers in your data — all inside AWS's IAM and logging. AWS also offers AgentCore, a newer managed runtime for deploying agents (including ones built with other frameworks) with memory, identity, and observability.

?
When to reach for a platform vs roll your own
The simpler option still wins more often than you think.
+

For a single agent with a few tools, a plain loop in your own code (or a light framework) is often enough — see AI Agents & Tool Use. Reach for a managed platform when you need durable sessions, multi-agent coordination, sandboxed tool execution, or enterprise-grade tracing and identity. Do not adopt a heavy runtime for a one-shot prompt.

07 · Choosing & recap 4 min

Four levers decide
where your AI runs.

Strip away the brand names and almost every choice comes down to four questions: what it costs, where the data may live, how fast it must answer, and how hard it is to leave.

Cost

Spiky or low volume → pay-per-token hosted. High, steady volume → self-host can win on unit cost. Model the real traffic, not the demo.

Data residency

Regulated or air-gapped data → keep inference in your boundary (cloud platform in-region, or local). Check retention and region terms first.

Latency

Add up the hops: gateways and agent loops cost milliseconds. Co-locate the model with the app for tight, interactive paths.

Lock-in

Swapping a model is cheap; un-wiring knowledge bases, guardrails, and agent runtimes is not. A gateway hedges; deep platform features bind.

need a model? sensitive data? → in-region / local multi-provider? → gateway high steady volume? → self-host (vLLM) else, default → hosted API

A rough path, not a law: data rules and volume push you off the hosted default; multi-provider needs add a gateway.

  • Default to a hosted API on the cloud you already use — it is the cheapest way to learn what you actually need.
  • Add a gateway the moment you want failover or to compare providers.
  • Move to a full ML platform only when no off-the-shelf model fits — then MLOps is your home.
  • Self-host open models when cost at scale or data residency makes the ops worth it — not before.
1Buy the boring parts. Rent serving, scaling, and safety so you can spend effort on the product.
2Program to the API, not the vendor. One modelId or slug should be all that changes between models.
3Hosted first, self-host on purpose. Open models earn their ops only at steady scale or under strict data rules.
4Lock-in hides in the extras. Watch knowledge bases, guardrails, and agent runtimes — that is what is costly to leave.
5The simplest thing that works wins. A gateway, a platform, an agent runtime — add each only when the pain is real.
Knowledge check

Did it stick?

Five quick questions on managed platforms, gateways, hosted vs local, and agents — instant feedback, no sign-in.

Rate this deck
be the first

Navigate with ← → or scroll · back to library