Kubernetes From First Principles
A deep, interactive guide to understanding how Kubernetes actually works — from the orchestration problem it solves, through every moving part, all the way to building your own managed Kubernetes service and Cloud Controller Manager.
What Kubernetes Solves
The container orchestration problem: you've containerized your application. Now you need to run it reliably across a fleet of machines. That's harder than it sounds.
The Problem: Containers Alone Aren't Enough
Containers solved the packaging problem. With Docker, you can bundle your application and all its dependencies into a single, portable image. You can run that image on any machine that has a container runtime. But containers by themselves don't solve the operational problem.
Imagine you have a web application made up of 15 microservices, each running in a container. You have a fleet of 20 servers. Now answer these questions:
- Where does each container run? You need to decide which machine gets which container based on CPU, memory, disk, and GPU availability. This is the scheduling problem.
- What happens when a machine dies? The containers on that machine are gone. You need to detect the failure and restart those containers on healthy machines. This is the self-healing problem.
- How do containers find each other? Your frontend needs to talk to your API, which needs to talk to your database. Container IPs change every time they restart. This is the service discovery problem.
- How do you deploy a new version without downtime? You can't just stop everything and restart. You need to gradually roll out new containers while draining old ones. This is the rolling update problem.
- How do you scale up when traffic spikes? You need to spin up more container replicas and distribute traffic across them. This is the scaling problem.
- How do you manage configuration and secrets? Database passwords, API keys, feature flags — these need to be injected into containers without baking them into images. This is the configuration management problem.
- How do you handle persistent storage? Containers are ephemeral — their filesystems disappear when they stop. Databases and stateful services need storage that survives container restarts. This is the storage orchestration problem.
You could build custom scripts to solve each of these problems individually. But they interact in complex ways, and at scale, the combinatorial complexity becomes unmanageable. This is exactly the problem Kubernetes was designed to solve.
Before and After Kubernetes
- Manual placement of containers on servers via SSH
- Custom health-check scripts with cron jobs
- Hardcoded IPs and ports in config files
- Blue-green deploys with manual DNS switching
- SSH into machines to check container status
- Custom bash scripts for scaling up/down
- Different deployment process for every team
- Snowflake servers that drift from their intended state
- Declare desired state in YAML; scheduler places containers automatically
- Built-in health checks with automatic restart and rescheduling
- DNS-based service discovery and load balancing
- Rolling updates with rollout status, revision history, and rollback when needed
- Unified API to inspect and manage all workloads
- Horizontal Pod Autoscaler scales based on metrics
- Standardized deployment model for every team
- Continuous reconciliation: actual state converges to desired state
A Brief History: From Borg to Kubernetes
Kubernetes didn't appear out of nowhere. It was born from over a decade of experience running containers at massive scale inside Google.
Borg is Google's internal cluster management system, in use since the mid-2000s. At its peak, Borg manages hundreds of thousands of jobs across millions of machines in dozens of data centers. Every Google product you use — Search, Gmail, YouTube, Maps — runs on Borg. It solved scheduling, fault tolerance, service discovery, and resource management long before Docker made containers mainstream.
In 2014, Google decided to open-source a new system inspired by the lessons learned from Borg (and its successor, Omega). Three Google engineers — Joe Beda, Brendan Burns, and Craig McLuckie — started the Kubernetes project. The name comes from the Greek word for "helmsman" or "pilot."
The key insight behind making it open-source was strategic: Google believed that if the industry standardized on an orchestration platform, it would commoditize infrastructure and reduce the competitive advantage of AWS (which at the time was locking customers in with proprietary services). By giving away the orchestration layer, Google made it easier for workloads to move between clouds.
In 2015, Kubernetes 1.0 was released and Google donated the project to the newly formed Cloud Native Computing Foundation (CNCF). Since then, it has become the de facto standard for container orchestration, with every major cloud provider offering a managed Kubernetes service.
Kubernetes is fundamentally a declarative, desired-state system. You tell it what you want (e.g., "run 3 replicas of my web server"), and its job is to continuously make reality match your declaration. If a replica crashes, it creates a new one. If a node goes down, it reschedules pods to healthy nodes. This reconciliation-loop architecture is the single most important concept in Kubernetes.
Core Concepts
Most of Kubernetes is driven by API objects persisted through the API server and backed by etcd. Understanding those objects and how they relate to each other is the foundation for everything else.
The API Object Model
Kubernetes has a single, unified API. Most cluster resources — Pods, Services, network policies, storage volumes, and more — are represented as API objects. Containers themselves are runtime processes on nodes, described through Pod specs and reported through Pod status. Every object has four key parts:
- apiVersion — which API group and version this object belongs to (e.g.,
apps/v1,v1) - kind — what type of object this is (e.g.,
Pod,Deployment,Service) - metadata — identifying information: name, namespace, labels, annotations, UID
- spec — your desired state: what you want this object to look like
Most objects also have a status field, which is the actual current state as observed by Kubernetes. Controllers work to make the real world match spec, while status records what they currently observe.
The Essential Objects
Pod
The smallest deployable unit. A Pod wraps one or more containers that share the same network namespace (same IP address), the same storage volumes, and the same lifecycle. They're co-scheduled onto the same machine. Think of a Pod as a logical host — the containers inside it are like processes running on the same machine. Pods are ephemeral: they're created, they run, and when they die, they are not resurrected. Instead, a controller creates a replacement.
ReplicaSet
Ensures that a specified number of identical Pod replicas are running at any given time. If a Pod dies, the ReplicaSet controller notices (via the API server's watch mechanism) and creates a replacement. If there are too many pods, it terminates extras. You rarely create ReplicaSets directly — Deployments manage them for you — but understanding them is key because they're the mechanism that actually maintains your desired replica count.
Deployment
The most common way to run stateless applications. A Deployment manages ReplicaSets, which in turn manage Pods. The key value of a Deployment is declarative rolling updates: when you change the Pod template (e.g., new container image), the Deployment creates a new ReplicaSet, gradually scales it up, and scales the old ReplicaSet down. If something goes wrong, you can roll back to the previous ReplicaSet. The Deployment controller handles the rollout mechanics automatically; rollback is usually triggered explicitly or by higher-level automation.
Service
A stable network endpoint for accessing a set of Pods. Pods are ephemeral and their IPs change — a Service provides a fixed IP (ClusterIP) and DNS name that automatically routes traffic to healthy pods matching a label selector. Services come in several types: ClusterIP (internal only), NodePort (exposes on every node's IP), LoadBalancer (provisions a cloud load balancer), and ExternalName (DNS alias).
Namespace
A virtual partition within a cluster. Namespaces provide scope for names (two Pods can have the same name in different namespaces), a unit for access control (RBAC policies can be namespace-scoped), and a unit for resource quotas. Common namespaces: default, kube-system (for system components), kube-public. In multi-tenant environments, each team or environment often gets its own namespace.
ConfigMap & Secret
ConfigMaps hold non-sensitive configuration data as key-value pairs. Secrets hold sensitive data (passwords, tokens, certificates), stored base64-encoded (not encrypted by default — you need to enable encryption at rest). Both can be mounted as files inside a Pod or exposed as environment variables. This separates configuration from container images, so you can use the same image across dev, staging, and production.
StatefulSet
Like a Deployment, but for stateful applications (databases, message queues, distributed stores). StatefulSets provide stable, unique network identifiers (pod-0, pod-1, pod-2), stable persistent storage (each pod gets its own PersistentVolumeClaim that survives rescheduling), and ordered, graceful deployment and scaling. The pods are created in order (0, then 1, then 2) and terminated in reverse order.
DaemonSet
Ensures that a copy of a Pod runs on every node in the cluster (or a subset of nodes). Ideal for node-level infrastructure: log collectors (Fluentd), monitoring agents (Prometheus node-exporter), network plugins (CNI), storage plugins (CSI node driver). When a new node joins the cluster, the DaemonSet controller automatically schedules a Pod onto it.
Labels, Selectors, and the Loose-Coupling Model
Kubernetes objects are connected to each other not by direct references, but by labels and selectors. A label is a key-value pair attached to an object (e.g., app: web, tier: frontend, version: v2). A selector is a query that matches objects by their labels.
This is how a Service knows which Pods to route traffic to: it has a selector like app: web, and it automatically discovers all Pods with that label. This is how a ReplicaSet knows which Pods it owns. This is how network policies select which Pods they apply to. Labels are the glue of Kubernetes.
This design is intentional: it enables loose coupling. A Service doesn't need to know the names or IPs of individual Pods. It just says "give me everything labeled app: web." Pods can come and go, and the Service automatically adapts. This loose coupling is what makes Kubernetes resilient and flexible.
The ownership hierarchy in Kubernetes goes: Deployment → ReplicaSet → Pod. The Deployment controller creates/manages ReplicaSets. The ReplicaSet controller creates/manages Pods. Each level watches the level below and reconciles. This layered controller model is how Kubernetes implements complex behaviors (like rolling updates) from simple primitives.
The Deployment → ReplicaSet → Pod Relationship
Architecture Overview
A Kubernetes cluster is split into two planes: the control plane (the brain) and the data plane (the muscle). Understanding this split is essential.
The Two Planes
Every Kubernetes cluster has a control plane and a data plane. The control plane makes decisions about the cluster: scheduling, detecting failures, responding to events. The data plane runs the actual workloads: your application containers.
In a self-managed cluster, the control plane components typically run on dedicated control plane nodes (usually 3 for high availability). In a managed Kubernetes service (like EKS, GKE, or the one you're building), the cloud provider operates the control plane for you — but users still interact with it through the Kubernetes API endpoint while running workloads on the data plane nodes.
Control Plane Components at a Glance
| Component | What It Does | Talks To |
|---|---|---|
| API Server | The front door. Every interaction with the cluster goes through the API server's REST API. It authenticates, authorizes, validates, and persists API objects to etcd. It also serves as the pub-sub hub — controllers watch the API server for changes. | etcd (read/write), all other components (they all talk to API server) |
| etcd | Distributed key-value store that holds the entire cluster state. Every API object is serialized and stored here. Uses the Raft consensus algorithm for leader election and data replication across 3 or 5 nodes. | API Server only (nothing else should touch etcd directly) |
| Scheduler | Watches for newly created Pods with no assigned node. Evaluates each node's fitness (CPU, memory, affinity rules, taints) and picks the best one. Writes the node assignment back to the API server. | API Server (watch for unscheduled pods, write node binding) |
| Controller Manager | Runs the core control loops: ReplicaSet controller, Deployment controller, Node controller, Job controller, EndpointSlice controller, etc. Each loop watches for specific API objects and takes action to reconcile desired state with actual state. | API Server (watch objects, update status) |
| Cloud Controller Manager | Runs cloud-specific control loops: provisioning load balancers, managing node lifecycle (detecting deleted VMs), configuring routes. This is the component you'll be implementing for your cloud. | API Server (watch objects), Cloud APIs (provision resources) |
Data Plane Components at a Glance
| Component | What It Does | Talks To |
|---|---|---|
| Kubelet | The node agent. Runs on every worker node. Watches the API server for Pods assigned to its node. Manages the full pod lifecycle: pulls images, creates containers (via CRI), runs health checks, reports status back. Also handles volume mounting and secret injection. | API Server (watch pods, report status), Container Runtime (via CRI), CSI drivers (via CSI) |
| kube-proxy | Manages network rules on each node to implement Service abstraction. Watches for Service and EndpointSlice objects and programs iptables/IPVS rules so that traffic to a Service ClusterIP gets forwarded to a healthy backend Pod. | API Server (watch Services and EndpointSlices), Node's network stack (iptables/IPVS) |
| Container Runtime | The software that actually runs containers. Kubernetes talks to it via the Container Runtime Interface (CRI). Common runtimes: containerd, CRI-O. The runtime pulls images from registries and creates/manages the actual Linux containers (using runc or similar OCI runtime under the hood). | Kubelet (via CRI gRPC), OCI runtime (runc), Image registries |
The Reconciliation Loop: Kubernetes' Core Pattern
The single most important architectural pattern in Kubernetes is the reconciliation loop (also called the "control loop" or "observe-diff-act" loop). Every controller in Kubernetes follows this exact pattern:
- Observe: Watch the API server for the current state of the objects you care about (e.g., "how many Pods exist with label app=web?")
- Diff: Compare the current state against the desired state (e.g., "the Deployment spec says 3 replicas, but only 2 exist")
- Act: Take the minimum action needed to drive current state toward desired state (e.g., "create 1 new Pod")
- Repeat: Go back to step 1 and do it again. Continuously. Forever.
This pattern makes Kubernetes self-healing by design. If a Pod crashes, a node goes down, or someone manually deletes a resource, the relevant controller will detect the drift and correct it. There's no central orchestrator issuing commands — instead, many independent controllers each manage their own slice of the world, all converging toward the declared desired state.
When you build a Cloud Controller Manager, you are writing controllers that follow this exact same pattern. Your node controller watches Node objects and reconciles them against your cloud's VM inventory. Your service controller watches Services and reconciles them against your cloud's load balancers. Understanding this pattern deeply is the key to building a correct CCM.
API Server & etcd
The API server is the brain of Kubernetes — every interaction flows through it. etcd is the memory — it stores the entire cluster state. Together, they form the foundation.
The API Server: The Single Source of Truth
The kube-apiserver is the only component that talks directly to etcd. Every other component — the scheduler, controllers, kubelet, even kubectl — interacts with the cluster exclusively through the API server's REST API.
This design is intentional and important:
- Single point of serialization: All reads and writes to cluster state go through one gateway, making it possible to enforce authentication, authorization, validation, and admission control consistently.
- Watch mechanism: The API server implements an efficient event-streaming mechanism. Components can "watch" for changes to specific resource types and get notified in near-real-time when objects are created, updated, or deleted. This is how controllers know when to reconcile.
- Optimistic concurrency: Every object has a
resourceVersionfield. When you update an object, you must include the resourceVersion you read. If someone else modified the object in between, the update is rejected with a 409 Conflict. This prevents lost updates without locking.
Authentication & Authorization
Every request to the API server is first authenticated (who are you?) and then authorized (can you do this?). Kubernetes supports multiple authentication methods: X.509 client certificates, bearer tokens, OIDC tokens, webhook token review, and service account tokens (used by pods to talk to the API server).
Authorization is handled by RBAC (Role-Based Access Control) in most clusters. You define Roles (what operations are allowed on which resources) and bind them to users or service accounts with RoleBindings. RBAC is namespace-scoped (Roles and RoleBindings) or cluster-wide (ClusterRoles and ClusterRoleBindings).
Admission Controllers
After authentication and authorization, the request passes through admission controllers. These are plugins that can mutate (modify the request) or validate (accept/reject the request) API objects. There are two phases:
- Mutating admission: Can modify the object before it's persisted. Example: injecting default resource limits, adding sidecar containers (Istio does this), setting default storage classes.
- Validating admission: Can only accept or reject. Example: enforcing that all containers must have resource limits, rejecting pods that try to run as root.
Both phases support webhooks: you can run your own admission logic as an external HTTP server, and the API server will call it for every relevant request. This is extremely powerful for enforcing custom policies.
The Request Flow Through the API Server
kubectl applyetcd: The Cluster's Memory
etcd is a distributed, strongly consistent key-value store. It's the authoritative source of truth for the entire cluster state. Every API object — every Pod, Deployment, Service, Secret, ConfigMap, and CustomResourceDefinition — is stored in etcd as a key-value pair.
How Data is Stored
Objects are stored under a key hierarchy. For example, a Pod named "web-1" in namespace "production" is stored at: /registry/pods/production/web-1. The value is the serialized API object (Protocol Buffers by default, though JSON is also possible). The API server abstracts this away — clients never need to know etcd's key layout.
Raft Consensus
etcd uses the Raft consensus algorithm to replicate data across its cluster members. Here's how it works in simplified terms:
- Leader election: One etcd node is the leader at any time. All writes go through the leader. If the leader fails, the remaining nodes hold an election to choose a new one. This election completes in milliseconds.
- Log replication: When the leader receives a write, it appends it to its log and sends it to followers. Once a majority (quorum) of nodes have acknowledged the write, it's considered committed.
- Quorum: With 3 nodes, quorum is 2 (can tolerate 1 failure). With 5 nodes, quorum is 3 (can tolerate 2 failures). This is why etcd clusters are always odd-numbered.
- Strong consistency: Any read from the leader (or a linearizable read from a follower) returns the latest committed data. There is no "eventual consistency" in etcd — if a write is acknowledged, subsequent reads will see it.
The Watch Mechanism
etcd provides an efficient watch API. Clients can open a long-lived gRPC stream and receive notifications whenever keys matching a prefix are modified. The API server uses this to implement its own watch mechanism: controllers watch the API server, and the API server watches etcd. This creates a near-real-time event notification system without polling.
Watches are also resumable: if the connection drops, the client can reconnect and specify a resourceVersion to resume from, so it doesn't miss any events.
etcd is the most critical component in a Kubernetes cluster. If etcd data is lost, the entire cluster state is lost — Kubernetes literally doesn't know what Pods should be running, what Services exist, or what the desired state is. etcd backups are non-negotiable in production. For a managed Kubernetes service, your etcd backup and restore strategy is one of the most important design decisions you'll make.
Scheduler & Controllers
The scheduler decides where Pods run. The controller manager keeps reality in sync with your declarations. Together, they are the decision-making machinery of Kubernetes.
The Scheduler: How Pods Get Placed on Nodes
When a Pod is first created (e.g., by a ReplicaSet controller), it has no spec.nodeName set. The scheduler watches for these "unscheduled" Pods and assigns each one to a suitable node. The process has two phases:
Phase 1: Filtering (Predicates)
The scheduler eliminates nodes that cannot run the Pod. Each filter plugin returns "fits" or "doesn't fit." A node must pass all filters to remain a candidate. Common filters:
- NodeResourcesFit: Does the node have enough allocatable CPU and memory to satisfy the Pod's resource requests?
- NodeAffinity: Does the node match the Pod's
nodeAffinityrules (e.g., "only run on nodes with GPU=true label")? - TaintToleration: Does the Pod tolerate the node's taints? Taints are "repel rules" on nodes — a node tainted with
gpu=true:NoSchedulewill reject pods that don't explicitly tolerate that taint. - PodTopologySpread: Does scheduling here violate the Pod's spread constraints (e.g., "spread evenly across availability zones")?
- NodePorts: If the Pod uses
hostPort, is that port available on this node? - VolumeBinding: Can the Pod's requested PersistentVolumeClaims be satisfied on this node?
Phase 2: Scoring (Priorities)
Among the nodes that passed filtering, the scheduler ranks them. Each scoring plugin gives each node a score from 0 to 100. The scores are weighted and summed. The node with the highest total score wins. Common scoring plugins and strategies:
- NodeResourcesFit: Can favor emptier nodes to spread load, or fuller nodes to bin-pack workloads and reduce the number of machines in use.
- InterPodAffinity: Scores based on whether co-located pods would satisfy affinity or violate anti-affinity rules.
- PodTopologySpread: Prefers placements that keep replicas balanced across zones, nodes, or other topology domains.
- ImageLocality: Prefers nodes that already have the required container images cached locally (avoids image pull latency).
- Preferred affinity and taint handling: Soft preferences such as preferred node affinity and tolerated taints can influence the final ranking without acting as hard filters.
The Controller Manager: Reconciliation at Scale
The kube-controller-manager is a single binary that runs dozens of independent control loops. Each loop is responsible for a specific type of API object and drives the cluster toward the desired state for that object type.
How Controllers Work Internally
Each controller uses a pattern called Informer + Work Queue:
- Informer: A client-side cache that watches the API server via a long-lived connection. When an object is created, updated, or deleted, the Informer receives the event and updates its local cache. It also calls registered event handlers.
- Event handler: On receiving an event (add/update/delete), the handler extracts the object's key (namespace/name) and adds it to a work queue. It does not do any work here — just queues the item.
- Work queue: A rate-limited, deduplicating queue. If multiple events arrive for the same object before the worker processes it, only one reconciliation runs. This prevents thundering herd problems.
- Worker: Pulls items from the work queue and calls the
Reconcile()function. This function reads the object from the Informer cache, compares desired state with actual state, and takes action (create/update/delete sub-resources). If the reconciliation fails, the item is re-queued with exponential backoff.
Key Controllers in kube-controller-manager
| Controller | Watches | Does |
|---|---|---|
| Deployment | Deployment, ReplicaSet | Creates new ReplicaSets on spec change, manages rollout (scale up new, scale down old), supports pause/resume/rollback |
| ReplicaSet | ReplicaSet, Pod | Ensures the exact number of Pod replicas. Creates or deletes Pods as needed. |
| Node | Node | Monitors node health via heartbeats. Sets NodeConditions, applies node-lifecycle taints, and eventually evicts pods from unhealthy nodes after grace and toleration periods. |
| EndpointSlice | Service, Pod | Maintains the set of IP addresses for Pods backing each Service. Updates when Pods are added/removed/become (un)ready. |
| Job | Job, Pod | Manages batch workloads. Creates Pods to run tasks to completion. Tracks success/failure, supports parallelism and retries. |
| ServiceAccount | Namespace | Creates a "default" ServiceAccount in every new namespace. |
| Namespace | Namespace | Handles namespace deletion: deletes all objects in the namespace before finalizing. |
| PersistentVolume | PVC, PV | Binds PersistentVolumeClaims to available PersistentVolumes. In modern clusters, dynamic provisioning is commonly performed by external CSI provisioners. |
Leader Election
In a highly available (HA) setup, you run multiple replicas of the controller manager. But only one should be actively reconciling at a time to avoid conflicts. Kubernetes uses leader election via a Lease object in etcd: one instance holds the lease and acts as leader. Others are on standby. If the leader dies, a standby instance acquires the lease within seconds. This same pattern applies to the scheduler and the CCM.
Kubelet & Container Runtime
The kubelet is the node-level agent that actually makes containers run. It bridges the gap between the control plane's desired state and the reality on each machine.
Kubelet: The Node Agent
The kubelet runs on every node in the cluster. It is responsible for:
- Pod lifecycle management: Watches the API server for Pods assigned to its node (via
spec.nodeName). Creates, starts, stops, and restarts containers. Handles init containers, sidecar containers, and ephemeral containers. - Health checking: Runs liveness, readiness, and startup probes. Restarts containers that fail liveness probes. Removes Pods from Service endpoints when readiness probes fail.
- Resource management: Enforces CPU and memory limits via cgroups. Evicts Pods when the node runs out of disk or memory (eviction thresholds are configurable).
- Volume management: Mounts ConfigMaps, Secrets, PersistentVolumeClaims, and projected volumes into Pod containers.
- Status reporting: Periodically updates the Node object's
status(allocatable resources, conditions, addresses) and each Pod'sstatus(phase, container states, IPs) in the API server. - Node heartbeats: Sends periodic heartbeats (Lease objects) to the API server to signal that the node is alive. The node controller in the control plane uses these to detect node failures.
Pod Startup Sequence
When the kubelet picks up a new Pod, here's what happens in order:
Admit the Pod
The kubelet runs admission checks: are there enough resources? Does the Pod's security context match the node's policies? If admission fails, the Pod is rejected.
Create the Pod Sandbox
The kubelet asks the container runtime (via CRI) to create a "sandbox" — this is the shared network namespace for all containers in the Pod. The CNI plugin is invoked here to set up networking: allocate an IP, configure interfaces, set up routes.
Pull Container Images
For each container in the Pod spec, the kubelet instructs the runtime to pull the image (if not already cached). The imagePullPolicy determines whether to always pull, never pull, or pull only if not present.
Run Init Containers
Init containers run sequentially, in order, before any app containers start. Each must complete successfully before the next starts. They're used for setup tasks: database migrations, config file generation, waiting for dependencies.
Start App Containers
All regular containers in the Pod start in parallel. The kubelet calls the CRI to create and start each container. Volumes are mounted, environment variables are injected, and the container's entrypoint is executed.
Run Probes
The startup probe runs first (if configured), giving the container time to initialize. Once the startup probe passes, liveness and readiness probes begin running on their configured intervals.
CRI: Container Runtime Interface
Kubernetes doesn't run containers directly. It delegates to a container runtime via a gRPC interface called the Container Runtime Interface (CRI). This abstraction means Kubernetes can work with any CRI-compliant runtime.
The CRI has two main services: RuntimeService (create/start/stop/remove containers and sandboxes) and ImageService (pull/list/remove images). The kubelet calls these gRPC methods, and the runtime handles the low-level details of creating Linux namespaces, setting up cgroups, mounting filesystems, and executing the container process.
What Happens Under the Hood: Containers are Linux Processes
A container is not a VM. It's a regular Linux process that runs in isolation using two kernel features:
- Namespaces: Provide isolation. A container commonly gets its own PID namespace (can't see host processes), network namespace (own IP stack), mount namespace (own filesystem), IPC namespace, and UTS namespace (own hostname). User namespaces exist too, but they are optional and not universal in Kubernetes deployments.
- cgroups (Control Groups): Limit resources. A container's cgroup restricts how much CPU and memory the process can use, and can also participate in other kernel-level resource controls. This is the core mechanism behind Kubernetes
resources.limits.
When you set resources.limits.memory: 256Mi in a Pod spec, the kubelet tells the container runtime to create a cgroup with a memory limit of 256 MiB. If the process exceeds this, the Linux kernel's OOM killer terminates it, and the kubelet restarts the container.
What Happens When a Node Dies
In a cloud environment, the Cloud Controller Manager's node controller plays a critical role here. When the CCM detects that a VM has been terminated by the cloud provider (e.g., the user deleted it, or a spot instance was reclaimed), it can immediately remove the Node object instead of waiting for heartbeat timeouts and toleration-based eviction delays. This can reduce recovery time from minutes down to seconds.
Networking & Services
Kubernetes networking is one of the most complex and important topics. Every Pod gets its own IP. Every Service gets a stable endpoint. Here's how it all works.
The Kubernetes Networking Model
Kubernetes imposes a few fundamental networking rules. These are non-negotiable — every network implementation must satisfy them:
- Every Pod gets a unique IP address. No two Pods in the cluster share an IP.
- All Pods can communicate with all other Pods without NAT. If Pod A has IP 10.244.1.5 and Pod B has IP 10.244.2.8, Pod A can reach Pod B by simply connecting to 10.244.2.8. No port mapping, no NAT translation. This massively simplifies application networking.
- All Nodes can communicate with all Pods (and vice versa) without NAT.
- The IP that a Pod sees for itself is the same IP that others see for it. No hidden NAT that could confuse applications.
These rules define what is needed, but not how to implement it. The implementation is delegated to CNI plugins.
CNI: Container Network Interface
CNI is a specification for configuring network interfaces in Linux containers. When a Pod is created, the kubelet calls the CNI plugin to:
- Create a virtual ethernet (veth) pair: one end in the Pod's network namespace, one end on the host
- Assign an IP address to the Pod from the node's Pod CIDR range
- Set up routes so traffic to/from the Pod is properly forwarded
- Configure any network policies (firewall rules)
Popular CNI plugins include Calico (BGP-based, supports network policies), Cilium (eBPF-based, high performance), Flannel (simple overlay network), and AWS VPC CNI (assigns real VPC IPs to Pods). For your managed K8s service, your choice of CNI plugin (or building your own) is a critical architectural decision — it determines how Pod IPs relate to your cloud's VPC networking.
Services: Stable Endpoints for Pods
Pods are ephemeral — they get created and destroyed constantly, and their IPs change each time. A Service provides a stable, load-balanced endpoint that always routes to the right Pods.
Service Types
| Type | How It Works | Use Case |
|---|---|---|
| ClusterIP | Assigns a virtual IP from the Service CIDR range (e.g., 10.96.0.0/16). This IP is only reachable from within the cluster. kube-proxy programs iptables/IPVS rules on every node to NAT traffic from the ClusterIP to a backend Pod IP. | Internal communication between microservices |
| NodePort | Builds on ClusterIP. Additionally opens a static port (30000-32767) on every node's IP. External traffic to <nodeIP>:<nodePort> is forwarded to the ClusterIP, then to a backend Pod. Every node listens on this port, regardless of whether it runs a backend Pod. |
Simple external access without a cloud load balancer |
| LoadBalancer | Often builds on NodePort, but not always. It provisions an external load balancer from the cloud provider (this is where your CCM comes in). Some implementations route traffic to node ports, while others target endpoints or pod IPs more directly. The Service's status.loadBalancer.ingress is updated with the LB's external IP or hostname. |
Production external access with cloud-native load balancing |
| ExternalName | Does not create a ClusterIP or proxy rules. Instead, creates a DNS CNAME record that maps the Service name to an external hostname (e.g., my-db.rds.amazonaws.com). |
Referencing external services by a stable internal DNS name |
kube-proxy: The Network Plumber
kube-proxy runs on every node and implements the Service abstraction. It watches the API server for Service and EndpointSlice objects and programs the node's network stack accordingly. It commonly operates in iptables or IPVS modes, though some environments use newer dataplanes or replace kube-proxy entirely:
- iptables mode (default): Creates iptables rules in the PREROUTING and OUTPUT chains. When a packet is destined for a ClusterIP, iptables performs DNAT (destination NAT) to rewrite the destination IP to a randomly selected backend Pod IP. This happens entirely in the kernel — no userspace proxy involved.
- IPVS mode: Uses Linux IPVS (IP Virtual Server) instead of iptables. IPVS is a transport-layer load balancer built into the kernel. It's more efficient for clusters with thousands of Services because it uses hash tables instead of sequential iptables rules. Supports multiple load balancing algorithms: round-robin, least connections, source hash, etc.
How a Service Routes Traffic
DNS: Service Discovery
Kubernetes runs an internal DNS server (CoreDNS) as a Deployment in the kube-system namespace. Every Service automatically gets a DNS record:
<service-name>.<namespace>.svc.cluster.local→ ClusterIP (A record)- For headless Services (no ClusterIP): DNS returns the individual Pod IPs directly
- For ExternalName Services: DNS returns a CNAME to the external hostname
Every Pod's /etc/resolv.conf is configured to use CoreDNS as its nameserver, with search domains so you can refer to services in the same namespace by just their name (e.g., curl http://web/).
Managed Kubernetes Architecture
How cloud providers like AWS, Google, and Azure run Kubernetes as a service — and the architectural decisions you'll face building your own.
What "Managed" Means
In a managed Kubernetes service, the cloud provider runs the control plane (API server, etcd, scheduler, controller manager, CCM) on behalf of the customer. The customer still uses that control plane through the Kubernetes API, but usually only manages the worker nodes (or in serverless-style offerings like EKS Fargate or GKE Autopilot, not even that).
The value proposition is clear: customers get a production-grade Kubernetes API without having to operate etcd, handle control plane upgrades, manage certificates, or worry about API server availability. They interact with Kubernetes the same way — kubectl apply works identically — but the operational burden of the control plane is entirely on the cloud provider.
The Key Architectural Decisions
Control Plane Isolation
Dedicated: Each cluster gets its own control plane VMs (API server, etcd instances). Full isolation, but expensive. Services like EKS and GKE generally present a separate control plane per cluster, even though the provider's internal implementation details are abstracted. Shared: Multiple customers or clusters share control plane infrastructure (e.g., shared API servers with strong logical isolation or virtual clusters). More cost-efficient, but requires strong multi-tenancy isolation. For a new service, dedicated control planes are simpler and safer to start with.
etcd Topology
Where does etcd run? Options: (a) co-located with the API server on the same VM, (b) on separate dedicated VMs, (c) as a managed service. Separate etcd is better for isolation and performance. You also need to decide on cluster size (3 or 5 nodes), backup strategy (continuous snapshots to object storage), and restore procedures.
Network Architecture
How does the control plane communicate with worker nodes? Typically, the API server is exposed via a load balancer, and kubelets on worker nodes connect outbound to it. For private clusters, you need a VPN or private link between the control plane VPC and the customer's VPC. Your CNI plugin must integrate with your cloud's VPC networking.
Node Provisioning
How are worker nodes created? Typically, the managed service provides an API to create "node pools" (groups of VMs with the same configuration). The VMs are provisioned with a bootstrap script that installs kubelet, configures it with the API server endpoint and authentication token, and joins the cluster.
Upgrade Strategy
How do you upgrade the control plane Kubernetes version? Two main approaches: in-place (stop the old version, start the new version on the same infra — causes brief downtime) or blue-green (stand up a new control plane, migrate traffic, tear down old one — zero downtime but more complex). Node upgrades typically use rolling replacement: drain a node, replace it with a new one running the new version.
API Server Endpoint
The API server must be reachable by kubelets, kubectl, and CI/CD systems. Options: public endpoint (accessible from the internet, protected by auth), private endpoint (only accessible within the cloud VPC), or both. Most managed services support both, with a toggle. The endpoint is typically served by a cloud load balancer in front of 3 API server replicas.
Typical Managed K8s Architecture
The Provisioning Flow
When a customer requests a new Kubernetes cluster through your managed service, here's what your platform needs to do:
Allocate Infrastructure
Provision the VMs (or containers) for the control plane: 3 API server instances, 3 etcd instances, 1+ scheduler and controller-manager instances. Allocate networking: a VPC (or subnet) for the control plane, a load balancer for the API server endpoint.
Generate Certificates
Create a Certificate Authority (CA) for the cluster. Generate TLS certificates for: the API server (signed for the LB hostname/IP), etcd peer and client certificates, the controller-manager and scheduler client certificates, the front-proxy certificates for aggregated API servers.
Bootstrap etcd
Start the etcd cluster with the generated certificates. Verify that all nodes have formed a healthy cluster and elected a leader. Configure backup schedules.
Start the Control Plane
Start the API server, pointing it at etcd. Start the controller-manager and scheduler, pointing them at the API server. Start the Cloud Controller Manager. Install CoreDNS and kube-proxy as cluster add-ons.
Configure Node Bootstrapping
Create bootstrap tokens (or configure a token signing key) that new worker nodes will use to authenticate with the API server during their initial join. Set up the TLS bootstrapping flow so kubelets can request signed certificates.
Generate kubeconfig
Generate the customer's kubeconfig file containing the API server endpoint, the CA certificate, and user credentials (typically an OIDC configuration or an exec-based credential plugin that integrates with your cloud's IAM).
Cloud Controller Manager Deep Dive
The CCM is the bridge between Kubernetes and your cloud. It's the component that makes type: LoadBalancer work, that labels nodes with zone information, and that cleans up when VMs are deleted.
Why CCM Exists
Originally, cloud-specific logic was embedded directly inside kube-controller-manager and kubelet. This was problematic: cloud providers had to fork the Kubernetes codebase or wait for upstream releases to include their changes. It also bloated the core with code that was irrelevant to non-cloud deployments.
The CCM was introduced to extract all cloud-dependent code into a separate, pluggable binary. Now, the core Kubernetes components are cloud-agnostic, and each cloud provider maintains their own CCM that implements a well-defined interface.
When you run Kubernetes with an external CCM:
kube-controller-managerand kubelet are configured with--cloud-provider=externalso in-tree cloud logic is disabled- The API server remains cloud-agnostic and talks to the CCM through normal Kubernetes APIs
- The CCM runs as a Deployment or static pod (or standalone binary on control plane nodes)
- The CCM needs credentials to your cloud's API, though CSI drivers, CNIs, or autoscalers may also need cloud credentials depending on your design
The cloud.Interface
To implement a CCM, you implement the cloud.Interface from the k8s.io/cloud-provider package. The exact surface evolves across Kubernetes releases, so the snippet below is a simplified, partial sketch of the major concepts rather than a verbatim copy of the latest source:
Each sub-interface (LoadBalancer, InstancesV2, Zones, Routes) defines the methods the CCM calls when it needs to interact with your cloud. Let's look at each one.
The Four Controller Loops
Node Controller
The node controller is responsible for keeping Node objects in sync with your cloud's actual VM inventory. It has two main jobs:
1. Node initialization: When a new worker node joins the cluster, the kubelet creates a Node object, but it's not fully populated — it doesn't know its cloud-specific addresses, availability zone, or instance type. The node controller detects uninitialized nodes (they have a node.cloudprovider.kubernetes.io/uninitialized taint), calls your InstancesV2.InstanceMetadata() method to get metadata from your cloud API, and updates the Node object with:
- Addresses: internal IP, external IP, hostname
- Labels:
topology.kubernetes.io/zone,topology.kubernetes.io/region,node.kubernetes.io/instance-type - Provider ID: A unique identifier for the VM in your cloud (e.g.,
mycloud://region/zone/instance-id)
After initialization, the taint is removed, and the scheduler can place workloads on the node.
2. Node cleanup: The node controller periodically checks whether each Node's backing VM still exists in the cloud. If the VM has been deleted (e.g., user terminated it, spot instance reclaimed), the controller deletes the Node object from Kubernetes immediately. This is faster than waiting for heartbeat timeouts.
Service Controller (Load Balancers)
The service controller watches for Services of type: LoadBalancer and provisions/deprovisions cloud load balancers accordingly. This is probably the most complex part of a CCM to implement correctly.
When a LoadBalancer Service is created, the service controller calls your LoadBalancer.EnsureLoadBalancer() method. Your implementation must:
- Create a cloud load balancer (or update an existing one)
- Configure listeners on the ports specified in the Service
- Set up health checks and backend targets appropriate for your provider, often NodePorts on worker nodes but not always
- Configure security groups / firewall rules to allow traffic
- Return the load balancer's external IP or hostname
When the Service is updated (e.g., ports change, annotations change), EnsureLoadBalancer() is called again — your code must handle updates idempotently. When the Service is deleted, EnsureLoadBalancerDeleted() is called, and you must clean up all cloud resources.
Route Controller
The route controller configures cloud routes so that Pods on different nodes can communicate. In some networking setups, each node has a Pod CIDR (e.g., node1 has 10.244.1.0/24, node2 has 10.244.2.0/24). The cloud's VPC routing table needs to know: "packets destined for 10.244.1.0/24 should go to node1's VM."
The route controller watches Node objects and calls your Routes.CreateRoute() for each node, ensuring the VPC routing table is up to date. When nodes are removed, it calls DeleteRoute().
Note: If you use a CNI plugin that handles its own routing (e.g., Calico with BGP, or a VPC CNI that assigns VPC IPs directly to Pods), you may not need the route controller at all.
How CCM Provisions a Load Balancer
Building Your CCM: Practical Guide
Here's a conceptual CCM skeleton for your cloud. The actual method set, imports, and command wiring vary a bit across Kubernetes releases, so treat this as pseudocode that shows the shape of the integration rather than copy-paste-ready source.
Rate limiting: Cloud API calls can be slow and rate-limited. Your CCM will be making API calls for every node health check, every LB operation, every route update. Implement caching and rate limiting, or you'll hit your cloud's API limits.
Idempotency: Every method must be idempotent. EnsureLoadBalancer() might be called multiple times for the same Service (e.g., on CCM restart). It must create the LB if it doesn't exist and update it if it does, without creating duplicates.
Error handling: Transient cloud API errors should not cause the controller to give up. The reconciliation loop will retry, but make sure you're returning appropriate errors so the item is re-queued.
Cluster Bootstrapping
How a Kubernetes cluster goes from nothing to a running system. Understanding this process deeply is essential for building a managed service that provisions clusters reliably.
The Certificate Hierarchy
Kubernetes security relies heavily on TLS certificates and, in many core control-plane paths, mutual TLS (mTLS). But not every interaction uses client certificates: Kubernetes also uses bootstrap tokens, service account tokens, OIDC, and other auth mechanisms. You still need a robust PKI (Public Key Infrastructure) before you can start the core control plane safely.
Certificates You Need to Generate
| Certificate | Used By | Purpose |
|---|---|---|
| Cluster CA key + cert | Everything | Root of trust. All certs are signed by this CA. The CA cert is distributed to all components so they can verify each other. |
| API server serving cert | API server | TLS server cert. Must include SANs for: the API server DNS name, the ClusterIP of the kubernetes Service, the load balancer IP/hostname, and localhost. |
| API server kubelet client cert | API server | Used when the API server connects to kubelet (e.g., for kubectl logs, kubectl exec). Must be in the system:masters group. |
| Controller manager client cert | Controller manager | Client cert to authenticate to the API server. Grants the controller manager identity. |
| Scheduler client cert | Scheduler | Client cert to authenticate to the API server. |
| etcd CA + peer certs | etcd | Separate CA for etcd. Peer certs for node-to-node communication. Server certs for serving client connections. |
| API server etcd client cert | API server | Client cert signed by etcd CA, used by API server to connect to etcd. |
| Front proxy CA + client cert | API server | Used for aggregated API servers (e.g., metrics-server). The API server uses the front proxy client cert when forwarding requests to extension API servers. |
| Kubelet client certs | Kubelet (per node) | Each kubelet gets a unique client cert, typically via TLS bootstrapping (see below). |
| Service account signing key | Controller manager, API server | An RSA key pair used to sign and verify service account tokens (JWTs). |
The Bootstrap Process
Here's the complete sequence for standing up a Kubernetes cluster from scratch. For a managed service, you'll automate every step of this.
Generate the PKI
Generate the root CA, etcd CA, front proxy CA, and all certificates listed above. Use a tool like cfssl, OpenSSL, or your cloud's KMS/certificate service. Store the CA private keys securely — they're the keys to the kingdom.
Start etcd
Start the etcd cluster on 3 (or 5) nodes. Each node needs: the etcd binary, its peer certificate, its server certificate, and the etcd CA certificate. The initial cluster configuration specifies all members. Wait for the cluster to form, elect a leader, and become healthy. Verify with etcdctl endpoint health.
Start the API Server
Start kube-apiserver with the following critical flags: --etcd-servers (etcd endpoints), --service-cluster-ip-range (e.g., 10.96.0.0/16), --tls-cert-file and --tls-private-key-file (serving cert), --client-ca-file (cluster CA for client auth), --etcd-certfile and --etcd-keyfile (etcd client cert), --service-account-key-file, --service-account-signing-key-file, and --service-account-issuer. The API server is now running, but it's a cluster of one with no scheduler, no controllers, and no nodes.
Start Controller Manager and Scheduler
Start kube-controller-manager with: --kubeconfig (pointing at the API server with the controller-manager client cert), --cluster-signing-cert-file and --cluster-signing-key-file (the CA cert/key for signing kubelet certificate requests), --cloud-provider=external (since you're using a CCM). Start kube-scheduler similarly with its kubeconfig. Start the CCM with its kubeconfig and cloud credentials.
Install Cluster Add-ons
Use kubectl apply to install essential add-ons: CoreDNS (for service discovery DNS), kube-proxy (for Service routing — often as a DaemonSet), the CNI plugin (for pod networking). These run as Pods inside the cluster.
TLS Bootstrapping for Nodes
Create a bootstrap token (or configure a token signing key). When new worker nodes start, their kubelet uses this bootstrap token to authenticate to the API server and submit a Certificate Signing Request (CSR). An approver approves eligible kubelet CSRs (if auto-approval is enabled), and the controller-manager's signing controller signs the approved request and issues a client certificate. The kubelet switches from the bootstrap token to its new certificate. This is how nodes can join the cluster without pre-provisioned certificates.
Node Joins the Cluster
The kubelet starts, uses its new certificate to connect to the API server, and registers a Node object. The CCM's node controller sees the new uninitialized node, fetches metadata from your cloud API, and initializes it (adds addresses, labels, removes the taint). The scheduler can now place Pods on this node. The cluster is ready.
kubeadm: The Reference Implementation
kubeadm is the official tool for bootstrapping Kubernetes clusters. While you'll likely build your own provisioning system for a managed service, understanding kubeadm's flow is valuable because it implements the same bootstrap process described above.
kubeadm init: Generates certificates, starts the control plane as static pods (manifests in/etc/kubernetes/manifests/that the kubelet auto-starts), installs CoreDNS, and outputs akubeadm joincommand with a bootstrap token.kubeadm join: Uses the bootstrap token to connect to the API server, validates the CA via a discovery token hash, requests a client certificate via TLS bootstrapping, and starts the kubelet.
For a managed service, you'll replace kubeadm with your own automation — but the underlying certificate generation, etcd bootstrapping, and component startup sequence is the same.
In a managed service, the control plane and worker nodes are in separate failure domains. The control plane runs in your cloud provider infrastructure (possibly a different VPC). Worker nodes run in the customer's infrastructure. The bootstrap token or join mechanism must securely bridge this gap. You'll likely implement a custom join flow where your provisioning API generates a short-lived token, injects it into the node's startup script (via cloud-init or user data), and the node uses it to join the correct cluster.