Chapter 7 — Infrastructure: AI Is Just Cloud at Scale

Opening Scenario

A healthcare technology company built a machine learning platform to power diagnostic assistance tools across their product line. They did everything by the book—or so they thought. Models were trained in a dedicated environment, inference ran on GPU clusters in their cloud provider's managed Kubernetes service, and data scientists accessed the platform through a VPN-protected portal.

The platform worked well for eighteen months. Then a routine cloud security audit flagged something unexpected: the GPU nodes running inference workloads were communicating with IP addresses outside the company's known network ranges. Investigation revealed that a container image used in the inference pipeline—pulled from a public registry six months earlier—had been compromised. The image contained a cryptocurrency miner that activated only when GPU resources exceeded a threshold, hiding its activity during testing but running constantly in production.

The miner was the visible symptom. The deeper problem was worse. The compromised container had access to the same network segment as the model serving infrastructure. It could observe API traffic, including the diagnostic queries flowing through the system. For months, patient symptom descriptions and preliminary diagnostic suggestions had been exposed to an attacker who had never touched the AI system directly—they had simply poisoned a dependency in the container supply chain.

The security team discovered that the inference nodes had been deployed with default service account permissions, granting them broad read access across the cluster. The GPU nodes had no network segmentation from the rest of the platform. The container registry had no signature verification. The monitoring system had flagged the anomalous network traffic weeks earlier, but the alert was categorized as "infrastructure noise" and deprioritized.

Every failure was an infrastructure failure. The AI was incidental.

This chapter is about infrastructure—and why organizations building AI systems keep relearning lessons that cloud security solved a decade ago.


Why This Area Matters

The conventional view treats AI infrastructure as a specialized domain requiring specialized knowledge. GPU clusters, model registries, vector databases, inference endpoints—these sound like AI problems requiring AI solutions. Vendors reinforce this by selling "AI security platforms" that promise to secure your machine learning infrastructure with purpose-built tools.

This framing is wrong, and dangerously so.

The real problem is that AI infrastructure is cloud infrastructure with different resource profiles. The security fundamentals don't change because you're running matrix multiplications instead of web servers. Identity, network segmentation, secrets management, supply chain integrity, observability—these are the same problems every cloud-native organization faces. They just manifest differently when compute is measured in GPU-hours and data flows through embedding pipelines instead of REST APIs.

What actually happens is this: organizations treat AI infrastructure as greenfield, ignoring the mature practices they've developed for everything else. Data science teams spin up GPU clusters with default configurations. MLOps pipelines pull dependencies without verification. Model artifacts live in object storage buckets with overly permissive access policies. The same organization that would never deploy a web application without network segmentation deploys inference endpoints with flat network access to production databases.

This matters because infrastructure failures don't look like AI failures. The healthcare company didn't have a "machine learning security incident"—they had a container supply chain compromise, a network segmentation failure, and an identity misconfiguration. The fact that GPUs were involved was irrelevant to the attack path. But because the system was labeled "AI," the security team treated it as outside their domain.

The architectural question is not "how do we secure AI infrastructure?" It's "why aren't we applying the cloud security practices we already know to systems that happen to run AI workloads?"


Architectural Breakdown

The Infrastructure Stack, Demystified

AI systems run on infrastructure that looks exotic but isn't. Strip away the terminology, and you find familiar components with familiar security requirements:

Compute layer: GPU nodes, TPUs, or CPU clusters running training and inference workloads. These are virtual machines or containers with accelerator hardware attached. They need the same controls as any compute: hardened images, minimal privileges, network isolation, patch management.

Storage layer: Object storage for datasets and model artifacts, block storage for working data, databases for metadata and feature stores. The security model is identical to any other storage: access controls, encryption, versioning, retention policies.

Network layer: VPCs, subnets, load balancers, API gateways, service meshes. Traffic flows between components need the same segmentation, inspection, and monitoring as any distributed system.

Identity layer: Service accounts for automated workloads, user identities for data scientists and operators, machine identities for inter-service communication. The requirements are standard: least privilege, credential rotation, federation where appropriate.

Orchestration layer: Kubernetes, managed ML platforms, workflow engines, CI/CD pipelines. These are control planes that manage other infrastructure and require the same hardening as any control plane.

The components map directly:

[Traditional Cloud]              [AI Infrastructure]
─────────────────────────────────────────────────────
Web servers                  →   Inference endpoints
Application databases        →   Feature stores, vector DBs
CI/CD pipelines             →   Training pipelines
Container registries        →   Model registries
API gateways               →   Model serving gateways
Monitoring/logging          →   Experiment tracking, model monitoring

The security controls map just as directly. If you know how to secure a Kubernetes cluster, you know how to secure a Kubernetes cluster running inference workloads. The GPU doesn't change the threat model.

Identity: The Foundation That Gets Ignored

Identity is where AI infrastructure most commonly fails, and where the failure has the most cascading consequences.

The pattern is consistent: data scientists need access to train models. Training jobs need access to data. Inference services need access to models. The path of least resistance is broad permissions—and broad permissions become the architecture.

Consider a typical training pipeline:

[Data Scientist] → (user identity) → [Notebook Environment]
                                            ↓
                                    [Training Job]
                                            ↓
                                    (service account)
                                            ↓
                    [Data Lake] ← → [GPU Cluster] ← → [Model Registry]

The architectural questions multiply:

  • Does the training job run with the data scientist's identity or a service account?
  • If a service account, is it scoped to this job or shared across all training jobs?
  • Can the job access all data in the lake or only training data?
  • Can it write to any location in the model registry or only designated paths?
  • Does the GPU cluster have network access to anything beyond what's required?

Most organizations answer these questions with "whatever works." The training job inherits a powerful service account because debugging permission errors is slower than granting access. The GPU cluster has broad network access because tracking down connectivity issues delays model development. The model registry is world-readable because access control setup is deferred as a "future improvement."

Each of these shortcuts is familiar to anyone who's secured cloud infrastructure. The solutions are equally familiar:

Workload identity: Training jobs should have identities tied to the specific workload, not inherited from users or shared across jobs. Cloud providers offer workload identity federation for exactly this purpose. If your training job runs with a shared service account, you can't attribute actions to specific jobs and can't scope permissions appropriately.

Just-in-time access: Data scientists don't need permanent access to production data. They need access when they're actively training models, revoked when they're not. Implement time-bounded access grants rather than standing permissions.

Least privilege by default: Service accounts should start with no permissions and have capabilities added as required. This inverts the common pattern where accounts start with broad access that's theoretically tightened later but never actually is.

Identity for models: Models themselves should have identity. When a model is deployed for inference, the inference endpoint should authenticate as that specific model version, not as a generic "inference service." This enables per-model access controls, per-model audit trails, and per-model revocation.

The identity failures in AI infrastructure aren't novel—they're the same failures organizations made with cloud infrastructure in 2015. The difference is that AI systems often handle more sensitive data and make more consequential decisions than the web applications that taught us these lessons.

Network Segmentation: Flat Networks, Catastrophic Failures

The healthcare scenario illustrated a network segmentation failure: inference nodes could communicate with arbitrary external addresses, and internal network traffic wasn't isolated by workload type.

This is endemic in AI infrastructure. The reason is architectural: AI systems have complex data flows, and complex data flows are annoying to segment. Training jobs need to pull data from multiple sources, push models to registries, log metrics to tracking systems, and communicate with orchestration services. Rather than map these flows and create appropriate network policies, teams put everything in the same network segment.

The blast radius of this decision is severe. A compromised component—whether through supply chain attack, vulnerability exploitation, or misconfiguration—gains visibility into all traffic in the segment. For AI systems, that traffic includes:

  • Training data flowing to compute nodes
  • Inference requests containing user data
  • Model artifacts being loaded and served
  • Prompts and completions for generative models
  • Embeddings that can be reversed to reveal source data

The segmentation model for AI infrastructure should mirror the logical separation of concerns:

Training plane: Isolated network segment for training workloads. Ingress from data sources, egress to model storage, no direct access to inference or production systems.

Inference plane: Isolated segment for serving models. Ingress from API gateways only, egress to logging and monitoring, no access to training data or raw datasets.

Data plane: Isolated segment for data storage and processing. Access controlled at the dataset level, egress only to authorized compute.

Control plane: Isolated segment for orchestration, scheduling, and management. Highly restricted access, separate from data and compute paths.

                    [External]
                        ↓
                  [API Gateway]
                        ↓
              ┌─────────────────┐
              │  Inference Plane │
              │  (Model Serving) │
              └────────┬────────┘
                       ↓ (read-only)
              ┌─────────────────┐
              │  Model Registry  │
              └────────┬────────┘
                       ↑ (write)
              ┌─────────────────┐
              │  Training Plane  │
              │  (GPU Clusters)  │
              └────────┬────────┘
                       ↓ (read-only)
              ┌─────────────────┐
              │   Data Plane     │
              │  (Storage/Lake)  │
              └─────────────────┘

Traffic between planes should flow through defined interfaces with inspection and logging. Direct pod-to-pod or VM-to-VM communication across planes should be impossible by default.

This isn't AI-specific architecture. It's the same segmentation model you'd apply to any multi-tier application. The failure to apply it to AI systems isn't a knowledge gap—it's a discipline gap.

Supply Chain: The Forgotten Dependency Graph

AI systems have deep dependency graphs that organizations rarely map, let alone secure. The healthcare incident involved a compromised container image, but the supply chain exposure extends much further:

Base images and OS dependencies: Training and inference containers are built on base images with operating system packages. Those packages have vulnerabilities. Those vulnerabilities are exploitable.

ML framework dependencies: PyTorch, TensorFlow, JAX, and their ecosystems depend on hundreds of packages. A compromised package in that chain affects every model trained or served with that framework.

Model weights and pre-trained components: Organizations increasingly start from pre-trained models or embed third-party models as components. Those model weights came from somewhere. Do you know where? Do you trust that source?

Data dependencies: Training data often includes third-party datasets, synthetic data from external services, or data transformations performed by external tools. Each is a supply chain link.

Inference-time dependencies: Some models call external services during inference—retrieval systems, tool endpoints, validation services. Those services are dependencies with their own supply chains.

The supply chain security model for AI infrastructure should address each layer:

Image provenance: Build images from known-good base images with verified signatures. Scan images for vulnerabilities before deployment. Don't pull from public registries without verification.

Dependency pinning and scanning: Pin framework versions and dependencies. Scan for known vulnerabilities. Update deliberately, not automatically.

Model provenance: Document where models come from. Verify signatures where available. Treat downloaded model weights like downloaded executables—with appropriate suspicion.

Data provenance: Track data sources and transformations. Verify data integrity. Don't train on data you can't trace.

Runtime dependency monitoring: Inventory external services your AI systems call. Monitor those dependencies for availability and security posture.

The tooling for supply chain security exists. Software bill of materials (SBOM) generation, container scanning, dependency vulnerability databases—these are mature practices. The failure in AI infrastructure is applying them inconsistently or not at all.

Shared Responsibility in Hyperscaler Environments

Most AI infrastructure runs on hyperscaler platforms—AWS, Azure, Google Cloud—either directly or through managed ML services. The shared responsibility model matters here, and it's frequently misunderstood.

The hyperscaler is responsible for:

  • Physical security of data centers
  • Hardware security of compute and storage
  • Hypervisor and host OS security for managed services
  • Network infrastructure security within their backbone
  • Security of the managed service control planes

You are responsible for:

  • Configuration of managed services
  • Identity and access management
  • Network configuration within your environment
  • Data protection and encryption key management
  • Application and workload security
  • Vulnerability management for your code and dependencies

The boundary shifts depending on the service model:

IaaS (GPU VMs): You own almost everything above the hypervisor. OS patching, network configuration, identity management, workload security—all yours.

Managed Kubernetes (GKE, EKS, AKS): The provider manages the control plane. You manage the workloads, network policies, and node security.

Managed ML platforms (SageMaker, Vertex AI, Azure ML): The provider manages more of the stack. You still own data access controls, model security, and endpoint configuration.

Fully managed inference (Bedrock, Model API services): The provider manages the infrastructure entirely. You own prompt security, output handling, and integration security.

The failure mode is assuming the managed service handles security concerns it doesn't. A managed ML platform doesn't know what data is sensitive. A managed inference endpoint doesn't validate that your prompts are safe. A managed GPU cluster doesn't segment your workloads from each other.

Every managed service requires examining the shared responsibility model and asking: what am I still responsible for? The answer is always more than people expect.

Observability: You Can't Secure What You Can't See

Infrastructure observability for AI systems fails in predictable ways:

Metrics without context: Teams collect GPU utilization, memory consumption, inference latency. These are operational metrics. They don't tell you whether the system is being abused.

Logs without correlation: Training logs go one place, inference logs another, infrastructure logs a third. Correlating an incident across these sources requires manual effort that doesn't happen during a real incident.

Alerts tuned for availability, not security: Alerts fire when latency exceeds thresholds or when nodes fail. They don't fire when access patterns change, when data exfiltration is attempted, or when inference requests probe for vulnerabilities.

Missing audit trails: Who accessed what model, when, from where, with what permissions? In most AI infrastructure, this question is hard to answer or impossible.

The observability model for AI infrastructure should include:

Identity-aware logging: Every action should be attributable to an identity—human or service. Logs should capture not just what happened but who did it.

Data flow tracing: Trace data from ingestion through training to inference. Know where your data goes and what touches it.

Security-relevant metrics: Track access patterns, permission usage, network flows, and authentication events—not just performance metrics.

Correlated alerting: Build alerts that span infrastructure layers. A single failed login isn't interesting. A burst of failed logins followed by a successful login from a new IP followed by bulk data access is extremely interesting.

Immutable audit logs: Write audit logs to append-only storage outside the control of the system being audited. If an attacker can delete their traces, they will.

This observability model is standard for cloud security. Applying it to AI infrastructure is a matter of will, not capability.


Common Mistakes Organizations Make

Mistake 1: Treating AI Infrastructure as Special

What teams do: Create separate teams, separate processes, and separate security standards for AI infrastructure. Exempt AI workloads from standard cloud security reviews because "they're different."

Why it seems reasonable: AI systems do have different characteristics—GPU requirements, large data transfers, specialized frameworks. Teams with ML expertise are scarce. Applying standard processes feels like slowing down innovation.

Why it fails architecturally: The specialness framing becomes an excuse for skipping fundamentals. Separate processes mean separate (usually lower) security bars. The "AI is different" mindset prevents organizations from leveraging mature practices that directly apply. The container running inference is still a container. The network traffic is still network traffic. The identity model is still an identity model.

What it misses: AI infrastructure is cloud infrastructure. The differences are in degree, not kind. Organizations should extend existing security practices to AI workloads, not create parallel structures that inevitably have gaps.

Mistake 2: GPU Scarcity Driving Security Shortcuts

What teams do: Because GPU capacity is expensive and scarce, teams share GPU clusters across projects, defer patching because downtime is costly, and over-provision access so that anyone who needs compute can get it.

Why it seems reasonable: GPU time is genuinely expensive. Waiting for compute slows model development. Resource optimization is a reasonable engineering goal.

Why it fails architecturally: Scarcity-driven sharing destroys isolation. When multiple projects share GPU nodes, a compromise in one project affects all others. Deferred patching accumulates vulnerability exposure. Over-provisioned access means no one knows who's actually using the compute or for what.

What it misses: The cost of a security incident vastly exceeds the cost of additional GPU capacity or brief downtime for patching. The calculation that treats GPU scarcity as more important than security is wrong on its own terms.

Mistake 3: Assuming Managed Services Handle Security

What teams do: Deploy on managed ML platforms and assume the provider handles security. Configure default settings. Don't examine what the shared responsibility model actually covers.

Why it seems reasonable: Managed services exist to reduce operational burden. Security is part of that burden. Providers market their security certifications and compliance attestations.

Why it fails architecturally: Managed services handle infrastructure security, not workload security. They don't know what data is sensitive. They don't configure access controls for your use case. They don't segment your workloads appropriately. The shared responsibility model explicitly places these concerns on you.

What it misses: Managed services reduce some security work. They don't eliminate it. The work that remains is often the most important—data protection, access control, network configuration—and it's easier to neglect because the managed service creates an illusion of comprehensive coverage.

Mistake 4: Observability as an Afterthought

What teams do: Build AI systems first, add observability later. Deploy with default logging. Assume they'll add monitoring "once things stabilize."

Why it seems reasonable: Getting the system working is the first priority. Observability is operational polish. You can't optimize what isn't running.

Why it fails architecturally: Without observability from the start, you can't establish baselines. Without baselines, you can't detect anomalies. Without anomaly detection, incidents happen without detection for weeks or months. The healthcare scenario involved traffic anomalies flagged but ignored—but most organizations wouldn't even see the anomaly.

What it misses: Observability isn't polish. It's architecture. Retrofitting observability is far harder than building it in. The gap between "system works" and "system is observable" is where incidents live.

Mistake 5: Ignoring Infrastructure in Model Security Discussions

What teams do: Discuss "model security" as if it's separate from infrastructure security. Focus on prompt injection, adversarial inputs, and model extraction. Ignore the compute, storage, and network that the model depends on.

Why it seems reasonable: "Model security" is the novel part. Infrastructure security is understood. The interesting problems are at the model layer.

Why it fails architecturally: Most AI security incidents are infrastructure incidents. The healthcare scenario had nothing to do with the model itself—it was a supply chain and network segmentation failure. An attacker who can compromise infrastructure can bypass every model-level security control. Focusing on model security while ignoring infrastructure is protecting the penthouse while leaving the building unlocked.

What it misses: Infrastructure security is the foundation. Model security is built on top of it. You cannot have model security without infrastructure security. The reverse—infrastructure security without model security—at least provides a defensible position.


Architectural Questions to Ask

Identity and Access

  • Do training jobs run with dedicated workload identities, or shared service accounts?
  • Can you trace any infrastructure action to a specific human or automated identity?
  • Are service account credentials rotated automatically, and what's the rotation period?
  • Do deployed models have distinct identities, or do they share inference service credentials?
  • Can you revoke access to a specific model version without affecting other models?
  • Are there standing administrative privileges, or is privileged access just-in-time?

Why these matter: Identity is the foundation of authorization and audit. If you can't identify who did what, you can't enforce policies or investigate incidents.

Network Architecture

  • Are training, inference, and data storage on separate network segments?
  • What traffic can flow between segments, and is it explicitly allowed or implicitly permitted?
  • Can inference endpoints reach arbitrary external addresses, or is egress controlled?
  • Are GPU nodes network-isolated from each other, or can a compromised node observe peer traffic?
  • Is internal traffic encrypted, or does network segmentation provide the only protection?

Why these matter: Network segmentation limits blast radius. Flat networks turn component compromises into platform compromises.

Supply Chain

  • Do you build container images from verified base images with known provenance?
  • Are framework dependencies pinned to specific versions, and how often are they audited?
  • Can you produce a software bill of materials for any deployed model?
  • Where did your pre-trained models or model weights come from, and do you trust that source?
  • Do you scan images and dependencies for vulnerabilities before deployment?

Why these matter: Supply chain attacks are among the most effective against AI systems because dependency graphs are deep and verification is rare.

Shared Responsibility

  • For each managed service you use, can you articulate what security controls the provider handles versus what you handle?
  • Have you validated that your configuration of managed services meets your security requirements?
  • Do you review provider security bulletins and apply relevant remediations?
  • Are you monitoring the security controls you're responsible for, or assuming they're fine?

Why these matter: Misunderstanding shared responsibility is one of the most common sources of cloud security failures. AI systems add complexity but don't change the model.

Observability

  • Do your logs capture identity, not just actions?
  • Can you correlate events across training, inference, and infrastructure logs?
  • Are alerts tuned for security anomalies, not just availability?
  • Are audit logs immutable and stored outside the system being audited?
  • Can you reconstruct the full data flow for any inference request?
  • How long would it take to detect a compromised component in your AI infrastructure?

Why these matter: You can't secure what you can't see. Observability gaps are incident detection gaps.

Operational Hygiene

  • What's the patching cadence for GPU nodes and inference infrastructure?
  • Are security updates treated as urgently as feature releases?
  • Do you have runbooks for common AI infrastructure security scenarios?
  • Can you rebuild your AI infrastructure from scratch if required?
  • Are backups tested, and do they include everything needed for recovery?

Why these matter: Operational hygiene prevents the small issues that become large incidents. Deferred patching, untested backups, and missing runbooks are debts that come due during crises.


Key Takeaways

  • AI infrastructure is cloud infrastructure: The security fundamentals—identity, network segmentation, supply chain, observability—don't change because you're running ML workloads. Organizations that treat AI infrastructure as a special domain exempt from standard practices create the gaps that attackers exploit.

  • Identity failures cascade: When service accounts are over-permissioned, credentials are shared, or workloads lack distinct identities, every other security control weakens. Identity is the foundation; if it's wrong, nothing built on top can be right.

  • Network segmentation limits blast radius: Flat networks turn component compromises into platform compromises. Segmenting training, inference, and data planes—and controlling traffic between them—is basic hygiene that most AI infrastructure lacks.

  • The supply chain is deeper than you think: Container images, ML frameworks, pre-trained models, datasets—every layer has dependencies, and every dependency is an attack vector. If you can't produce a bill of materials for your AI system, you can't reason about its security.

  • Managed services shift responsibility, they don't eliminate it: Hyperscaler platforms handle infrastructure security. You handle everything else—configuration, access control, data protection, workload security. The shared responsibility model is explicit about this, but wishful thinking persists.

The core insight of this chapter connects to the lifecycle thesis: AI security starts with infrastructure security. The organizations that will secure AI systems are those that recognize there's nothing magical about GPU clusters, model registries, or inference endpoints—they're infrastructure components that require the same rigor as any other infrastructure. Those that treat AI as an exemption from standard practices will keep learning expensive lessons that cloud security solved years ago.

results matching ""

    No results matching ""