Kubernetes From First Principles

Section 01

What Kubernetes Solves

The container orchestration problem: you've containerized your application. Now you need to run it reliably across a fleet of machines. That's harder than it sounds.

The Problem: Containers Alone Aren't Enough

Containers solved the packaging problem. With Docker, you can bundle your application and all its dependencies into a single, portable image. You can run that image on any machine that has a container runtime. But containers by themselves don't solve the operational problem.

Imagine you have a web application made up of 15 microservices, each running in a container. You have a fleet of 20 servers. Now answer these questions:

Where does each container run? You need to decide which machine gets which container based on CPU, memory, disk, and GPU availability. This is the scheduling problem.
What happens when a machine dies? The containers on that machine are gone. You need to detect the failure and restart those containers on healthy machines. This is the self-healing problem.
How do containers find each other? Your frontend needs to talk to your API, which needs to talk to your database. Container IPs change every time they restart. This is the service discovery problem.
How do you deploy a new version without downtime? You can't just stop everything and restart. You need to gradually roll out new containers while draining old ones. This is the rolling update problem.
How do you scale up when traffic spikes? You need to spin up more container replicas and distribute traffic across them. This is the scaling problem.
How do you manage configuration and secrets? Database passwords, API keys, feature flags — these need to be injected into containers without baking them into images. This is the configuration management problem.
How do you handle persistent storage? Containers are ephemeral — their filesystems disappear when they stop. Databases and stateful services need storage that survives container restarts. This is the storage orchestration problem.

You could build custom scripts to solve each of these problems individually. But they interact in complex ways, and at scale, the combinatorial complexity becomes unmanageable. This is exactly the problem Kubernetes was designed to solve.

Before and After Kubernetes

Without Kubernetes

Manual placement of containers on servers via SSH
Custom health-check scripts with cron jobs
Hardcoded IPs and ports in config files
Blue-green deploys with manual DNS switching
SSH into machines to check container status
Custom bash scripts for scaling up/down
Different deployment process for every team
Snowflake servers that drift from their intended state

With Kubernetes

Declare desired state in YAML; scheduler places containers automatically
Built-in health checks with automatic restart and rescheduling
DNS-based service discovery and load balancing
Rolling updates with rollout status, revision history, and rollback when needed
Unified API to inspect and manage all workloads
Horizontal Pod Autoscaler scales based on metrics
Standardized deployment model for every team
Continuous reconciliation: actual state converges to desired state

A Brief History: From Borg to Kubernetes

Kubernetes didn't appear out of nowhere. It was born from over a decade of experience running containers at massive scale inside Google.

Borg is Google's internal cluster management system, in use since the mid-2000s. At its peak, Borg manages hundreds of thousands of jobs across millions of machines in dozens of data centers. Every Google product you use — Search, Gmail, YouTube, Maps — runs on Borg. It solved scheduling, fault tolerance, service discovery, and resource management long before Docker made containers mainstream.

In 2014, Google decided to open-source a new system inspired by the lessons learned from Borg (and its successor, Omega). Three Google engineers — Joe Beda, Brendan Burns, and Craig McLuckie — started the Kubernetes project. The name comes from the Greek word for "helmsman" or "pilot."

The key insight behind making it open-source was strategic: Google believed that if the industry standardized on an orchestration platform, it would commoditize infrastructure and reduce the competitive advantage of AWS (which at the time was locking customers in with proprietary services). By giving away the orchestration layer, Google made it easier for workloads to move between clouds.

In 2015, Kubernetes 1.0 was released and Google donated the project to the newly formed Cloud Native Computing Foundation (CNCF). Since then, it has become the de facto standard for container orchestration, with every major cloud provider offering a managed Kubernetes service.

Key Insight

Kubernetes is fundamentally a declarative, desired-state system. You tell it what you want (e.g., "run 3 replicas of my web server"), and its job is to continuously make reality match your declaration. If a replica crashes, it creates a new one. If a node goes down, it reschedules pods to healthy nodes. This reconciliation-loop architecture is the single most important concept in Kubernetes.

Section 02

Core Concepts

Most of Kubernetes is driven by API objects persisted through the API server and backed by etcd. Understanding those objects and how they relate to each other is the foundation for everything else.

The API Object Model

Kubernetes has a single, unified API. Most cluster resources — Pods, Services, network policies, storage volumes, and more — are represented as API objects. Containers themselves are runtime processes on nodes, described through Pod specs and reported through Pod status. Every object has four key parts:

apiVersion — which API group and version this object belongs to (e.g., apps/v1, v1)
kind — what type of object this is (e.g., Pod, Deployment, Service)
metadata — identifying information: name, namespace, labels, annotations, UID
spec — your desired state: what you want this object to look like

Most objects also have a status field, which is the actual current state as observed by Kubernetes. Controllers work to make the real world match spec, while status records what they currently observe.

        Example: A Deployment object
        YAML
      

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-server
  namespace: production
  labels:
    app: web
    tier: frontend
spec:
  replicas: 3                       # Desired state: I want 3 copies
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      containers:
      - name: nginx
        image: nginx:1.25
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: "100m"            # 100 millicores = 0.1 CPU
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "256Mi"
      

The Essential Objects

Pod

The smallest deployable unit. A Pod wraps one or more containers that share the same network namespace (same IP address), the same storage volumes, and the same lifecycle. They're co-scheduled onto the same machine. Think of a Pod as a logical host — the containers inside it are like processes running on the same machine. Pods are ephemeral: they're created, they run, and when they die, they are not resurrected. Instead, a controller creates a replacement.

RS

ReplicaSet

Ensures that a specified number of identical Pod replicas are running at any given time. If a Pod dies, the ReplicaSet controller notices (via the API server's watch mechanism) and creates a replacement. If there are too many pods, it terminates extras. You rarely create ReplicaSets directly — Deployments manage them for you — but understanding them is key because they're the mechanism that actually maintains your desired replica count.

D

Deployment

The most common way to run stateless applications. A Deployment manages ReplicaSets, which in turn manage Pods. The key value of a Deployment is declarative rolling updates: when you change the Pod template (e.g., new container image), the Deployment creates a new ReplicaSet, gradually scales it up, and scales the old ReplicaSet down. If something goes wrong, you can roll back to the previous ReplicaSet. The Deployment controller handles the rollout mechanics automatically; rollback is usually triggered explicitly or by higher-level automation.

Svc

Service

A stable network endpoint for accessing a set of Pods. Pods are ephemeral and their IPs change — a Service provides a fixed IP (ClusterIP) and DNS name that automatically routes traffic to healthy pods matching a label selector. Services come in several types: ClusterIP (internal only), NodePort (exposes on every node's IP), LoadBalancer (provisions a cloud load balancer), and ExternalName (DNS alias).

NS

Namespace

A virtual partition within a cluster. Namespaces provide scope for names (two Pods can have the same name in different namespaces), a unit for access control (RBAC policies can be namespace-scoped), and a unit for resource quotas. Common namespaces: default, kube-system (for system components), kube-public. In multi-tenant environments, each team or environment often gets its own namespace.

CM

ConfigMap & Secret

ConfigMaps hold non-sensitive configuration data as key-value pairs. Secrets hold sensitive data (passwords, tokens, certificates), stored base64-encoded (not encrypted by default — you need to enable encryption at rest). Both can be mounted as files inside a Pod or exposed as environment variables. This separates configuration from container images, so you can use the same image across dev, staging, and production.

SS

StatefulSet

Like a Deployment, but for stateful applications (databases, message queues, distributed stores). StatefulSets provide stable, unique network identifiers (pod-0, pod-1, pod-2), stable persistent storage (each pod gets its own PersistentVolumeClaim that survives rescheduling), and ordered, graceful deployment and scaling. The pods are created in order (0, then 1, then 2) and terminated in reverse order.

DS

DaemonSet

Ensures that a copy of a Pod runs on every node in the cluster (or a subset of nodes). Ideal for node-level infrastructure: log collectors (Fluentd), monitoring agents (Prometheus node-exporter), network plugins (CNI), storage plugins (CSI node driver). When a new node joins the cluster, the DaemonSet controller automatically schedules a Pod onto it.

Labels, Selectors, and the Loose-Coupling Model

Kubernetes objects are connected to each other not by direct references, but by labels and selectors. A label is a key-value pair attached to an object (e.g., app: web, tier: frontend, version: v2). A selector is a query that matches objects by their labels.

This is how a Service knows which Pods to route traffic to: it has a selector like app: web, and it automatically discovers all Pods with that label. This is how a ReplicaSet knows which Pods it owns. This is how network policies select which Pods they apply to. Labels are the glue of Kubernetes.

This design is intentional: it enables loose coupling. A Service doesn't need to know the names or IPs of individual Pods. It just says "give me everything labeled app: web." Pods can come and go, and the Service automatically adapts. This loose coupling is what makes Kubernetes resilient and flexible.

Key Insight

The ownership hierarchy in Kubernetes goes: Deployment → ReplicaSet → Pod. The Deployment controller creates/manages ReplicaSets. The ReplicaSet controller creates/manages Pods. Each level watches the level below and reconciles. This layered controller model is how Kubernetes implements complex behaviors (like rolling updates) from simple primitives.

The Deployment → ReplicaSet → Pod Relationship

Ownership Hierarchy

Deployment web-server

↓ manages

ReplicaSet (v2) web-server-6d4f5 • active

↓ owns

Pod 1

Pod 2

Pod 3

ReplicaSet (v1) web-server-8a3b2 • scaled to 0

↓

no pods (previous version)

Section 03

Architecture Overview

A Kubernetes cluster is split into two planes: the control plane (the brain) and the data plane (the muscle). Understanding this split is essential.

The Two Planes

Every Kubernetes cluster has a control plane and a data plane. The control plane makes decisions about the cluster: scheduling, detecting failures, responding to events. The data plane runs the actual workloads: your application containers.

In a self-managed cluster, the control plane components typically run on dedicated control plane nodes (usually 3 for high availability). In a managed Kubernetes service (like EKS, GKE, or the one you're building), the cloud provider operates the control plane for you — but users still interact with it through the Kubernetes API endpoint while running workloads on the data plane nodes.

Kubernetes Cluster Architecture

Control Plane

API Server kube-apiserver

etcd key-value store

Scheduler kube-scheduler

Controller Mgr kube-controller-manager

Cloud Controller cloud-controller-manager

↕ HTTPS (kubeconfig, TLS certificates) ↕

Data Plane (Worker Nodes)

Worker Node 1

Kubelet

kube-proxy

Container Runtime

Pod Pod Pod

Worker Node 2

Kubelet

kube-proxy

Container Runtime

Pod Pod

Worker Node 3

Kubelet

kube-proxy

Container Runtime

Pod Pod Pod Pod

Control Plane Components at a Glance

Component	What It Does	Talks To
API Server	The front door. Every interaction with the cluster goes through the API server's REST API. It authenticates, authorizes, validates, and persists API objects to etcd. It also serves as the pub-sub hub — controllers watch the API server for changes.	etcd (read/write), all other components (they all talk to API server)
etcd	Distributed key-value store that holds the entire cluster state. Every API object is serialized and stored here. Uses the Raft consensus algorithm for leader election and data replication across 3 or 5 nodes.	API Server only (nothing else should touch etcd directly)
Scheduler	Watches for newly created Pods with no assigned node. Evaluates each node's fitness (CPU, memory, affinity rules, taints) and picks the best one. Writes the node assignment back to the API server.	API Server (watch for unscheduled pods, write node binding)
Controller Manager	Runs the core control loops: ReplicaSet controller, Deployment controller, Node controller, Job controller, EndpointSlice controller, etc. Each loop watches for specific API objects and takes action to reconcile desired state with actual state.	API Server (watch objects, update status)
Cloud Controller Manager	Runs cloud-specific control loops: provisioning load balancers, managing node lifecycle (detecting deleted VMs), configuring routes. This is the component you'll be implementing for your cloud.	API Server (watch objects), Cloud APIs (provision resources)

Data Plane Components at a Glance

Component	What It Does	Talks To
Kubelet	The node agent. Runs on every worker node. Watches the API server for Pods assigned to its node. Manages the full pod lifecycle: pulls images, creates containers (via CRI), runs health checks, reports status back. Also handles volume mounting and secret injection.	API Server (watch pods, report status), Container Runtime (via CRI), CSI drivers (via CSI)
kube-proxy	Manages network rules on each node to implement Service abstraction. Watches for Service and EndpointSlice objects and programs iptables/IPVS rules so that traffic to a Service ClusterIP gets forwarded to a healthy backend Pod.	API Server (watch Services and EndpointSlices), Node's network stack (iptables/IPVS)
Container Runtime	The software that actually runs containers. Kubernetes talks to it via the Container Runtime Interface (CRI). Common runtimes: containerd, CRI-O. The runtime pulls images from registries and creates/manages the actual Linux containers (using runc or similar OCI runtime under the hood).	Kubelet (via CRI gRPC), OCI runtime (runc), Image registries

The Reconciliation Loop: Kubernetes' Core Pattern

The single most important architectural pattern in Kubernetes is the reconciliation loop (also called the "control loop" or "observe-diff-act" loop). Every controller in Kubernetes follows this exact pattern:

👁

Observe

→

⚖️

Diff

→

⚡

Act

→

🔄

Repeat

Observe: Watch the API server for the current state of the objects you care about (e.g., "how many Pods exist with label app=web?")
Diff: Compare the current state against the desired state (e.g., "the Deployment spec says 3 replicas, but only 2 exist")
Act: Take the minimum action needed to drive current state toward desired state (e.g., "create 1 new Pod")
Repeat: Go back to step 1 and do it again. Continuously. Forever.

This pattern makes Kubernetes self-healing by design. If a Pod crashes, a node goes down, or someone manually deletes a resource, the relevant controller will detect the drift and correct it. There's no central orchestrator issuing commands — instead, many independent controllers each manage their own slice of the world, all converging toward the declared desired state.

Why This Matters for CCM

When you build a Cloud Controller Manager, you are writing controllers that follow this exact same pattern. Your node controller watches Node objects and reconciles them against your cloud's VM inventory. Your service controller watches Services and reconciles them against your cloud's load balancers. Understanding this pattern deeply is the key to building a correct CCM.

Section 04

API Server & etcd

The API server is the brain of Kubernetes — every interaction flows through it. etcd is the memory — it stores the entire cluster state. Together, they form the foundation.

The API Server: The Single Source of Truth

The kube-apiserver is the only component that talks directly to etcd. Every other component — the scheduler, controllers, kubelet, even kubectl — interacts with the cluster exclusively through the API server's REST API.

This design is intentional and important:

Single point of serialization: All reads and writes to cluster state go through one gateway, making it possible to enforce authentication, authorization, validation, and admission control consistently.
Watch mechanism: The API server implements an efficient event-streaming mechanism. Components can "watch" for changes to specific resource types and get notified in near-real-time when objects are created, updated, or deleted. This is how controllers know when to reconcile.
Optimistic concurrency: Every object has a resourceVersion field. When you update an object, you must include the resourceVersion you read. If someone else modified the object in between, the update is rejected with a 409 Conflict. This prevents lost updates without locking.

Authentication & Authorization

Every request to the API server is first authenticated (who are you?) and then authorized (can you do this?). Kubernetes supports multiple authentication methods: X.509 client certificates, bearer tokens, OIDC tokens, webhook token review, and service account tokens (used by pods to talk to the API server).

Authorization is handled by RBAC (Role-Based Access Control) in most clusters. You define Roles (what operations are allowed on which resources) and bind them to users or service accounts with RoleBindings. RBAC is namespace-scoped (Roles and RoleBindings) or cluster-wide (ClusterRoles and ClusterRoleBindings).

Admission Controllers

After authentication and authorization, the request passes through admission controllers. These are plugins that can mutate (modify the request) or validate (accept/reject the request) API objects. There are two phases:

Mutating admission: Can modify the object before it's persisted. Example: injecting default resource limits, adding sidecar containers (Istio does this), setting default storage classes.
Validating admission: Can only accept or reject. Example: enforcing that all containers must have resource limits, rejecting pods that try to run as root.

Both phases support webhooks: you can run your own admission logic as an external HTTP server, and the API server will call it for every relevant request. This is extremely powerful for enforcing custom policies.

The Request Flow Through the API Server

What happens when you run kubectl apply

kubectl client

→

API Server authn → authz → admission → validation

→

etcd persist & replicate

→

Scheduler

Controllers

Kubelets

1

etcd: The Cluster's Memory

etcd is a distributed, strongly consistent key-value store. It's the authoritative source of truth for the entire cluster state. Every API object — every Pod, Deployment, Service, Secret, ConfigMap, and CustomResourceDefinition — is stored in etcd as a key-value pair.

How Data is Stored

Objects are stored under a key hierarchy. For example, a Pod named "web-1" in namespace "production" is stored at: /registry/pods/production/web-1. The value is the serialized API object (Protocol Buffers by default, though JSON is also possible). The API server abstracts this away — clients never need to know etcd's key layout.

Raft Consensus

etcd uses the Raft consensus algorithm to replicate data across its cluster members. Here's how it works in simplified terms:

Leader election: One etcd node is the leader at any time. All writes go through the leader. If the leader fails, the remaining nodes hold an election to choose a new one. This election completes in milliseconds.
Log replication: When the leader receives a write, it appends it to its log and sends it to followers. Once a majority (quorum) of nodes have acknowledged the write, it's considered committed.
Quorum: With 3 nodes, quorum is 2 (can tolerate 1 failure). With 5 nodes, quorum is 3 (can tolerate 2 failures). This is why etcd clusters are always odd-numbered.
Strong consistency: Any read from the leader (or a linearizable read from a follower) returns the latest committed data. There is no "eventual consistency" in etcd — if a write is acknowledged, subsequent reads will see it.

The Watch Mechanism

etcd provides an efficient watch API. Clients can open a long-lived gRPC stream and receive notifications whenever keys matching a prefix are modified. The API server uses this to implement its own watch mechanism: controllers watch the API server, and the API server watches etcd. This creates a near-real-time event notification system without polling.

Watches are also resumable: if the connection drops, the client can reconnect and specify a resourceVersion to resume from, so it doesn't miss any events.

Operational Warning

etcd is the most critical component in a Kubernetes cluster. If etcd data is lost, the entire cluster state is lost — Kubernetes literally doesn't know what Pods should be running, what Services exist, or what the desired state is. etcd backups are non-negotiable in production. For a managed Kubernetes service, your etcd backup and restore strategy is one of the most important design decisions you'll make.

Section 05

Scheduler & Controllers

The scheduler decides where Pods run. The controller manager keeps reality in sync with your declarations. Together, they are the decision-making machinery of Kubernetes.

The Scheduler: How Pods Get Placed on Nodes

When a Pod is first created (e.g., by a ReplicaSet controller), it has no spec.nodeName set. The scheduler watches for these "unscheduled" Pods and assigns each one to a suitable node. The process has two phases:

Phase 1: Filtering (Predicates)

The scheduler eliminates nodes that cannot run the Pod. Each filter plugin returns "fits" or "doesn't fit." A node must pass all filters to remain a candidate. Common filters:

NodeResourcesFit: Does the node have enough allocatable CPU and memory to satisfy the Pod's resource requests?
NodeAffinity: Does the node match the Pod's nodeAffinity rules (e.g., "only run on nodes with GPU=true label")?
TaintToleration: Does the Pod tolerate the node's taints? Taints are "repel rules" on nodes — a node tainted with gpu=true:NoSchedule will reject pods that don't explicitly tolerate that taint.
PodTopologySpread: Does scheduling here violate the Pod's spread constraints (e.g., "spread evenly across availability zones")?
NodePorts: If the Pod uses hostPort, is that port available on this node?
VolumeBinding: Can the Pod's requested PersistentVolumeClaims be satisfied on this node?

Phase 2: Scoring (Priorities)

Among the nodes that passed filtering, the scheduler ranks them. Each scoring plugin gives each node a score from 0 to 100. The scores are weighted and summed. The node with the highest total score wins. Common scoring plugins and strategies:

NodeResourcesFit: Can favor emptier nodes to spread load, or fuller nodes to bin-pack workloads and reduce the number of machines in use.
InterPodAffinity: Scores based on whether co-located pods would satisfy affinity or violate anti-affinity rules.
PodTopologySpread: Prefers placements that keep replicas balanced across zones, nodes, or other topology domains.
ImageLocality: Prefers nodes that already have the required container images cached locally (avoids image pull latency).
Preferred affinity and taint handling: Soft preferences such as preferred node affinity and tolerated taints can influence the final ranking without acting as hard filters.

How Pod Scheduling Works

Controller creates Pod

→

API Server stores Pod

→

Scheduler filter → score → bind

→

Kubelet starts containers

1

The Controller Manager: Reconciliation at Scale

The kube-controller-manager is a single binary that runs dozens of independent control loops. Each loop is responsible for a specific type of API object and drives the cluster toward the desired state for that object type.

How Controllers Work Internally

Each controller uses a pattern called Informer + Work Queue:

Informer: A client-side cache that watches the API server via a long-lived connection. When an object is created, updated, or deleted, the Informer receives the event and updates its local cache. It also calls registered event handlers.
Event handler: On receiving an event (add/update/delete), the handler extracts the object's key (namespace/name) and adds it to a work queue. It does not do any work here — just queues the item.
Work queue: A rate-limited, deduplicating queue. If multiple events arrive for the same object before the worker processes it, only one reconciliation runs. This prevents thundering herd problems.
Worker: Pulls items from the work queue and calls the Reconcile() function. This function reads the object from the Informer cache, compares desired state with actual state, and takes action (create/update/delete sub-resources). If the reconciliation fails, the item is re-queued with exponential backoff.

        Simplified controller loop (pseudocode)
        Go
      

func (c *ReplicaSetController) Reconcile(key string) error {
    // 1. Get the ReplicaSet object from the informer cache
    rs, err := c.rsLister.Get(key)

    // 2. List all Pods owned by this ReplicaSet (by label selector)
    pods, err := c.podLister.List(rs.Spec.Selector)

    // 3. Count active (non-terminated) pods
    activePods := filterActivePods(pods)

    // 4. Diff: compare actual count vs desired count
    diff := len(activePods) - *rs.Spec.Replicas

    if diff < 0 {
        // Too few pods: create abs(diff) new ones
        for i := 0; i < -diff; i++ {
            c.kubeClient.CoreV1().Pods(rs.Namespace).Create(newPod(rs))
        }
    } else if diff > 0 {
        // Too many pods: delete diff pods (least recent first)
        podsToDelete := selectPodsToDelete(activePods, diff)
        for _, pod := range podsToDelete {
            c.kubeClient.CoreV1().Pods(pod.Namespace).Delete(pod.Name)
        }
    }
    // 5. Update ReplicaSet status (actual replica count)
    rs.Status.Replicas = len(activePods)
    c.kubeClient.AppsV1().ReplicaSets(rs.Namespace).UpdateStatus(rs)
    return nil
}
      

Key Controllers in kube-controller-manager

Controller	Watches	Does
Deployment	Deployment, ReplicaSet	Creates new ReplicaSets on spec change, manages rollout (scale up new, scale down old), supports pause/resume/rollback
ReplicaSet	ReplicaSet, Pod	Ensures the exact number of Pod replicas. Creates or deletes Pods as needed.
Node	Node	Monitors node health via heartbeats. Sets NodeConditions, applies node-lifecycle taints, and eventually evicts pods from unhealthy nodes after grace and toleration periods.
EndpointSlice	Service, Pod	Maintains the set of IP addresses for Pods backing each Service. Updates when Pods are added/removed/become (un)ready.
Job	Job, Pod	Manages batch workloads. Creates Pods to run tasks to completion. Tracks success/failure, supports parallelism and retries.
ServiceAccount	Namespace	Creates a "default" ServiceAccount in every new namespace.
Namespace	Namespace	Handles namespace deletion: deletes all objects in the namespace before finalizing.
PersistentVolume	PVC, PV	Binds PersistentVolumeClaims to available PersistentVolumes. In modern clusters, dynamic provisioning is commonly performed by external CSI provisioners.

Leader Election

In a highly available (HA) setup, you run multiple replicas of the controller manager. But only one should be actively reconciling at a time to avoid conflicts. Kubernetes uses leader election via a Lease object in etcd: one instance holds the lease and acts as leader. Others are on standby. If the leader dies, a standby instance acquires the lease within seconds. This same pattern applies to the scheduler and the CCM.

Section 06

Kubelet & Container Runtime

The kubelet is the node-level agent that actually makes containers run. It bridges the gap between the control plane's desired state and the reality on each machine.

Kubelet: The Node Agent

The kubelet runs on every node in the cluster. It is responsible for:

Pod lifecycle management: Watches the API server for Pods assigned to its node (via spec.nodeName). Creates, starts, stops, and restarts containers. Handles init containers, sidecar containers, and ephemeral containers.
Health checking: Runs liveness, readiness, and startup probes. Restarts containers that fail liveness probes. Removes Pods from Service endpoints when readiness probes fail.
Resource management: Enforces CPU and memory limits via cgroups. Evicts Pods when the node runs out of disk or memory (eviction thresholds are configurable).
Volume management: Mounts ConfigMaps, Secrets, PersistentVolumeClaims, and projected volumes into Pod containers.
Status reporting: Periodically updates the Node object's status (allocatable resources, conditions, addresses) and each Pod's status (phase, container states, IPs) in the API server.
Node heartbeats: Sends periodic heartbeats (Lease objects) to the API server to signal that the node is alive. The node controller in the control plane uses these to detect node failures.

Pod Startup Sequence

When the kubelet picks up a new Pod, here's what happens in order:

Admit the Pod

The kubelet runs admission checks: are there enough resources? Does the Pod's security context match the node's policies? If admission fails, the Pod is rejected.

Create the Pod Sandbox

The kubelet asks the container runtime (via CRI) to create a "sandbox" — this is the shared network namespace for all containers in the Pod. The CNI plugin is invoked here to set up networking: allocate an IP, configure interfaces, set up routes.

Pull Container Images

For each container in the Pod spec, the kubelet instructs the runtime to pull the image (if not already cached). The imagePullPolicy determines whether to always pull, never pull, or pull only if not present.

Run Init Containers

Init containers run sequentially, in order, before any app containers start. Each must complete successfully before the next starts. They're used for setup tasks: database migrations, config file generation, waiting for dependencies.

Start App Containers

All regular containers in the Pod start in parallel. The kubelet calls the CRI to create and start each container. Volumes are mounted, environment variables are injected, and the container's entrypoint is executed.

Run Probes

The startup probe runs first (if configured), giving the container time to initialize. Once the startup probe passes, liveness and readiness probes begin running on their configured intervals.

CRI: Container Runtime Interface

Kubernetes doesn't run containers directly. It delegates to a container runtime via a gRPC interface called the Container Runtime Interface (CRI). This abstraction means Kubernetes can work with any CRI-compliant runtime.

Container Runtime Stack

Kubelet manages pod lifecycle

↓ CRI (gRPC)

containerd most common

↓ OCI

runc OCI runtime

or

CRI-O lightweight

↓ OCI

runc OCI runtime

The CRI has two main services: RuntimeService (create/start/stop/remove containers and sandboxes) and ImageService (pull/list/remove images). The kubelet calls these gRPC methods, and the runtime handles the low-level details of creating Linux namespaces, setting up cgroups, mounting filesystems, and executing the container process.

What Happens Under the Hood: Containers are Linux Processes

A container is not a VM. It's a regular Linux process that runs in isolation using two kernel features:

Namespaces: Provide isolation. A container commonly gets its own PID namespace (can't see host processes), network namespace (own IP stack), mount namespace (own filesystem), IPC namespace, and UTS namespace (own hostname). User namespaces exist too, but they are optional and not universal in Kubernetes deployments.
cgroups (Control Groups): Limit resources. A container's cgroup restricts how much CPU and memory the process can use, and can also participate in other kernel-level resource controls. This is the core mechanism behind Kubernetes resources.limits.

When you set resources.limits.memory: 256Mi in a Pod spec, the kubelet tells the container runtime to create a cgroup with a memory limit of 256 MiB. If the process exceeds this, the Linux kernel's OOM killer terminates it, and the kubelet restarts the container.

What Happens When a Node Dies

What happens when a node dies

Kubelet node (dead)

✕

API Server node status updated

→

Node Controller detects failure

→

RS Controller creates replacements

1

CCM Relevance

In a cloud environment, the Cloud Controller Manager's node controller plays a critical role here. When the CCM detects that a VM has been terminated by the cloud provider (e.g., the user deleted it, or a spot instance was reclaimed), it can immediately remove the Node object instead of waiting for heartbeat timeouts and toleration-based eviction delays. This can reduce recovery time from minutes down to seconds.

Section 07

Networking & Services

Kubernetes networking is one of the most complex and important topics. Every Pod gets its own IP. Every Service gets a stable endpoint. Here's how it all works.

The Kubernetes Networking Model

Kubernetes imposes a few fundamental networking rules. These are non-negotiable — every network implementation must satisfy them:

Every Pod gets a unique IP address. No two Pods in the cluster share an IP.
All Pods can communicate with all other Pods without NAT. If Pod A has IP 10.244.1.5 and Pod B has IP 10.244.2.8, Pod A can reach Pod B by simply connecting to 10.244.2.8. No port mapping, no NAT translation. This massively simplifies application networking.
All Nodes can communicate with all Pods (and vice versa) without NAT.
The IP that a Pod sees for itself is the same IP that others see for it. No hidden NAT that could confuse applications.

These rules define what is needed, but not how to implement it. The implementation is delegated to CNI plugins.

CNI: Container Network Interface

CNI is a specification for configuring network interfaces in Linux containers. When a Pod is created, the kubelet calls the CNI plugin to:

Create a virtual ethernet (veth) pair: one end in the Pod's network namespace, one end on the host
Assign an IP address to the Pod from the node's Pod CIDR range
Set up routes so traffic to/from the Pod is properly forwarded
Configure any network policies (firewall rules)

Popular CNI plugins include Calico (BGP-based, supports network policies), Cilium (eBPF-based, high performance), Flannel (simple overlay network), and AWS VPC CNI (assigns real VPC IPs to Pods). For your managed K8s service, your choice of CNI plugin (or building your own) is a critical architectural decision — it determines how Pod IPs relate to your cloud's VPC networking.

Services: Stable Endpoints for Pods

Pods are ephemeral — they get created and destroyed constantly, and their IPs change each time. A Service provides a stable, load-balanced endpoint that always routes to the right Pods.

Service Types

Type	How It Works	Use Case
ClusterIP	Assigns a virtual IP from the Service CIDR range (e.g., 10.96.0.0/16). This IP is only reachable from within the cluster. kube-proxy programs iptables/IPVS rules on every node to NAT traffic from the ClusterIP to a backend Pod IP.	Internal communication between microservices
NodePort	Builds on ClusterIP. Additionally opens a static port (30000-32767) on every node's IP. External traffic to `<nodeIP>:<nodePort>` is forwarded to the ClusterIP, then to a backend Pod. Every node listens on this port, regardless of whether it runs a backend Pod.	Simple external access without a cloud load balancer
LoadBalancer	Often builds on NodePort, but not always. It provisions an external load balancer from the cloud provider (this is where your CCM comes in). Some implementations route traffic to node ports, while others target endpoints or pod IPs more directly. The Service's `status.loadBalancer.ingress` is updated with the LB's external IP or hostname.	Production external access with cloud-native load balancing
ExternalName	Does not create a ClusterIP or proxy rules. Instead, creates a DNS CNAME record that maps the Service name to an external hostname (e.g., `my-db.rds.amazonaws.com`).	Referencing external services by a stable internal DNS name

kube-proxy: The Network Plumber

kube-proxy runs on every node and implements the Service abstraction. It watches the API server for Service and EndpointSlice objects and programs the node's network stack accordingly. It commonly operates in iptables or IPVS modes, though some environments use newer dataplanes or replace kube-proxy entirely:

iptables mode (default): Creates iptables rules in the PREROUTING and OUTPUT chains. When a packet is destined for a ClusterIP, iptables performs DNAT (destination NAT) to rewrite the destination IP to a randomly selected backend Pod IP. This happens entirely in the kernel — no userspace proxy involved.
IPVS mode: Uses Linux IPVS (IP Virtual Server) instead of iptables. IPVS is a transport-layer load balancer built into the kernel. It's more efficient for clusters with thousands of Services because it uses hash tables instead of sequential iptables rules. Supports multiple load balancing algorithms: round-robin, least connections, source hash, etc.

How a Service Routes Traffic

How a Service routes traffic

Client Pod sends request

→

kube-proxy iptables DNAT

→

Pod 10.244.1.5

Pod 10.244.2.3

Pod 10.244.3.7

1

DNS: Service Discovery

Kubernetes runs an internal DNS server (CoreDNS) as a Deployment in the kube-system namespace. Every Service automatically gets a DNS record:

<service-name>.<namespace>.svc.cluster.local → ClusterIP (A record)
For headless Services (no ClusterIP): DNS returns the individual Pod IPs directly
For ExternalName Services: DNS returns a CNAME to the external hostname

Every Pod's /etc/resolv.conf is configured to use CoreDNS as its nameserver, with search domains so you can refer to services in the same namespace by just their name (e.g., curl http://web/).

Section 08

Managed Kubernetes Architecture

How cloud providers like AWS, Google, and Azure run Kubernetes as a service — and the architectural decisions you'll face building your own.

What "Managed" Means

In a managed Kubernetes service, the cloud provider runs the control plane (API server, etcd, scheduler, controller manager, CCM) on behalf of the customer. The customer still uses that control plane through the Kubernetes API, but usually only manages the worker nodes (or in serverless-style offerings like EKS Fargate or GKE Autopilot, not even that).

The value proposition is clear: customers get a production-grade Kubernetes API without having to operate etcd, handle control plane upgrades, manage certificates, or worry about API server availability. They interact with Kubernetes the same way — kubectl apply works identically — but the operational burden of the control plane is entirely on the cloud provider.

The Key Architectural Decisions

1

Control Plane Isolation

Dedicated: Each cluster gets its own control plane VMs (API server, etcd instances). Full isolation, but expensive. Services like EKS and GKE generally present a separate control plane per cluster, even though the provider's internal implementation details are abstracted. Shared: Multiple customers or clusters share control plane infrastructure (e.g., shared API servers with strong logical isolation or virtual clusters). More cost-efficient, but requires strong multi-tenancy isolation. For a new service, dedicated control planes are simpler and safer to start with.

2

etcd Topology

Where does etcd run? Options: (a) co-located with the API server on the same VM, (b) on separate dedicated VMs, (c) as a managed service. Separate etcd is better for isolation and performance. You also need to decide on cluster size (3 or 5 nodes), backup strategy (continuous snapshots to object storage), and restore procedures.

3

Network Architecture

How does the control plane communicate with worker nodes? Typically, the API server is exposed via a load balancer, and kubelets on worker nodes connect outbound to it. For private clusters, you need a VPN or private link between the control plane VPC and the customer's VPC. Your CNI plugin must integrate with your cloud's VPC networking.

4

Node Provisioning

How are worker nodes created? Typically, the managed service provides an API to create "node pools" (groups of VMs with the same configuration). The VMs are provisioned with a bootstrap script that installs kubelet, configures it with the API server endpoint and authentication token, and joins the cluster.

5

Upgrade Strategy

How do you upgrade the control plane Kubernetes version? Two main approaches: in-place (stop the old version, start the new version on the same infra — causes brief downtime) or blue-green (stand up a new control plane, migrate traffic, tear down old one — zero downtime but more complex). Node upgrades typically use rolling replacement: drain a node, replace it with a new one running the new version.

6

API Server Endpoint

The API server must be reachable by kubelets, kubectl, and CI/CD systems. Options: public endpoint (accessible from the internet, protected by auth), private endpoint (only accessible within the cloud VPC), or both. Most managed services support both, with a toggle. The endpoint is typically served by a cloud load balancer in front of 3 API server replicas.

Typical Managed K8s Architecture

Managed Kubernetes Service Architecture

Cloud Provider's Infrastructure (you manage this)

Control Plane (per cluster)

Load Balancer → API Server (x3)

etcd cluster (3 nodes)

Scheduler + Controller Manager

Cloud Controller Manager

Platform Services

Provisioning API (create/delete clusters)

Certificate Authority

Monitoring & Logging

etcd Backup Service

↕ Cloud Network / VPC Peering / Private Link ↕

Customer's Infrastructure

Node Pool A (3 VMs)

Kubelet + kube-proxy

containerd

CNI Plugin

Customer's workloads (Pods)

Node Pool B (2 VMs, GPU)

Kubelet + kube-proxy

containerd

CNI Plugin

GPU workloads (Pods)

The Provisioning Flow

When a customer requests a new Kubernetes cluster through your managed service, here's what your platform needs to do:

Allocate Infrastructure

Provision the VMs (or containers) for the control plane: 3 API server instances, 3 etcd instances, 1+ scheduler and controller-manager instances. Allocate networking: a VPC (or subnet) for the control plane, a load balancer for the API server endpoint.

Generate Certificates

Create a Certificate Authority (CA) for the cluster. Generate TLS certificates for: the API server (signed for the LB hostname/IP), etcd peer and client certificates, the controller-manager and scheduler client certificates, the front-proxy certificates for aggregated API servers.

Bootstrap etcd

Start the etcd cluster with the generated certificates. Verify that all nodes have formed a healthy cluster and elected a leader. Configure backup schedules.

Start the Control Plane

Start the API server, pointing it at etcd. Start the controller-manager and scheduler, pointing them at the API server. Start the Cloud Controller Manager. Install CoreDNS and kube-proxy as cluster add-ons.

Configure Node Bootstrapping

Create bootstrap tokens (or configure a token signing key) that new worker nodes will use to authenticate with the API server during their initial join. Set up the TLS bootstrapping flow so kubelets can request signed certificates.

Generate kubeconfig

Generate the customer's kubeconfig file containing the API server endpoint, the CA certificate, and user credentials (typically an OIDC configuration or an exec-based credential plugin that integrates with your cloud's IAM).

Section 09

Cloud Controller Manager Deep Dive

The CCM is the bridge between Kubernetes and your cloud. It's the component that makes type: LoadBalancer work, that labels nodes with zone information, and that cleans up when VMs are deleted.

Why CCM Exists

Originally, cloud-specific logic was embedded directly inside kube-controller-manager and kubelet. This was problematic: cloud providers had to fork the Kubernetes codebase or wait for upstream releases to include their changes. It also bloated the core with code that was irrelevant to non-cloud deployments.

The CCM was introduced to extract all cloud-dependent code into a separate, pluggable binary. Now, the core Kubernetes components are cloud-agnostic, and each cloud provider maintains their own CCM that implements a well-defined interface.

When you run Kubernetes with an external CCM:

kube-controller-manager and kubelet are configured with --cloud-provider=external so in-tree cloud logic is disabled
The API server remains cloud-agnostic and talks to the CCM through normal Kubernetes APIs
The CCM runs as a Deployment or static pod (or standalone binary on control plane nodes)
The CCM needs credentials to your cloud's API, though CSI drivers, CNIs, or autoscalers may also need cloud credentials depending on your design

The cloud.Interface

To implement a CCM, you implement the cloud.Interface from the k8s.io/cloud-provider package. The exact surface evolves across Kubernetes releases, so the snippet below is a simplified, partial sketch of the major concepts rather than a verbatim copy of the latest source:

        Simplified cloud provider interface sketch
        Go
      

type Interface interface {
    // Initialize is called after the cloud provider is created.
    // clientBuilder can be used to build Kubernetes API clients.
    Initialize(clientBuilder ControllerClientBuilder, stop <-chan struct{})

    // LoadBalancer returns an implementation of the LoadBalancer interface,
    // or nil if load balancers are not supported.
    LoadBalancer() (LoadBalancer, bool)

    // Instances returns an implementation of the InstancesV2 interface,
    // or nil if instances are not supported.
    InstancesV2() (InstancesV2, bool)

    // Zones returns an implementation of the Zones interface,
    // or nil if zones are not supported.
    Zones() (Zones, bool)

    // Clusters returns an implementation of the Clusters interface.
    Clusters() (Clusters, bool)

    // Routes returns an implementation of the Routes interface,
    // or nil if routes are not supported.
    Routes() (Routes, bool)

    // ProviderName returns the name of the cloud provider.
    ProviderName() string

    // HasClusterID returns true if the cloud provider has a ClusterID.
    HasClusterID() bool
}
      

Each sub-interface (LoadBalancer, InstancesV2, Zones, Routes) defines the methods the CCM calls when it needs to interact with your cloud. Let's look at each one.

The Four Controller Loops

N Node Controller

The node controller is responsible for keeping Node objects in sync with your cloud's actual VM inventory. It has two main jobs:

1. Node initialization: When a new worker node joins the cluster, the kubelet creates a Node object, but it's not fully populated — it doesn't know its cloud-specific addresses, availability zone, or instance type. The node controller detects uninitialized nodes (they have a node.cloudprovider.kubernetes.io/uninitialized taint), calls your InstancesV2.InstanceMetadata() method to get metadata from your cloud API, and updates the Node object with:

Addresses: internal IP, external IP, hostname
Labels: topology.kubernetes.io/zone, topology.kubernetes.io/region, node.kubernetes.io/instance-type
Provider ID: A unique identifier for the VM in your cloud (e.g., mycloud://region/zone/instance-id)

After initialization, the taint is removed, and the scheduler can place workloads on the node.

2. Node cleanup: The node controller periodically checks whether each Node's backing VM still exists in the cloud. If the VM has been deleted (e.g., user terminated it, spot instance reclaimed), the controller deletes the Node object from Kubernetes immediately. This is faster than waiting for heartbeat timeouts.

          InstancesV2 interface (what you implement)
          Go
        

type InstancesV2 interface {
    // InstanceExists checks if the VM for this node still exists in the cloud.
    // Return false if the VM has been deleted.
    InstanceExists(ctx context.Context, node *v1.Node) (bool, error)

    // InstanceShutdown checks if the VM is in a stopped/shutdown state
    // (as opposed to deleted entirely).
    InstanceShutdown(ctx context.Context, node *v1.Node) (bool, error)

    // InstanceMetadata returns the instance's metadata: type, zone, addresses.
    // Called during node initialization.
    InstanceMetadata(ctx context.Context, node *v1.Node) (*InstanceMetadata, error)
}

type InstanceMetadata struct {
    ProviderID    string           // e.g., "mycloud://us-east-1a/i-1234567890"
    InstanceType  string           // e.g., "m5.xlarge"
    NodeAddresses []v1.NodeAddress  // internal IP, external IP, hostname
    Zone          string           // e.g., "us-east-1a"
    Region        string           // e.g., "us-east-1"
}
        

L Service Controller (Load Balancers)

The service controller watches for Services of type: LoadBalancer and provisions/deprovisions cloud load balancers accordingly. This is probably the most complex part of a CCM to implement correctly.

When a LoadBalancer Service is created, the service controller calls your LoadBalancer.EnsureLoadBalancer() method. Your implementation must:

Create a cloud load balancer (or update an existing one)
Configure listeners on the ports specified in the Service
Set up health checks and backend targets appropriate for your provider, often NodePorts on worker nodes but not always
Configure security groups / firewall rules to allow traffic
Return the load balancer's external IP or hostname

When the Service is updated (e.g., ports change, annotations change), EnsureLoadBalancer() is called again — your code must handle updates idempotently. When the Service is deleted, EnsureLoadBalancerDeleted() is called, and you must clean up all cloud resources.

          LoadBalancer interface
          Go
        

type LoadBalancer interface {
    // GetLoadBalancer returns the status of the LB for this service,
    // or (nil, false, nil) if it doesn't exist.
    GetLoadBalancer(ctx context.Context, clusterName string,
        service *v1.Service) (*v1.LoadBalancerStatus, bool, error)

    // GetLoadBalancerName returns the name of the LB for this service.
    GetLoadBalancerName(ctx context.Context, clusterName string,
        service *v1.Service) string

    // EnsureLoadBalancer creates or updates the LB for this service.
    // Returns the LB status (external IP/hostname).
    EnsureLoadBalancer(ctx context.Context, clusterName string,
        service *v1.Service, nodes []*v1.Node) (*v1.LoadBalancerStatus, error)

    // UpdateLoadBalancer updates the LB's node membership
    // (called when nodes are added/removed from the cluster).
    UpdateLoadBalancer(ctx context.Context, clusterName string,
        service *v1.Service, nodes []*v1.Node) error

    // EnsureLoadBalancerDeleted deletes the LB for this service.
    EnsureLoadBalancerDeleted(ctx context.Context, clusterName string,
        service *v1.Service) error
}
        

R Route Controller

The route controller configures cloud routes so that Pods on different nodes can communicate. In some networking setups, each node has a Pod CIDR (e.g., node1 has 10.244.1.0/24, node2 has 10.244.2.0/24). The cloud's VPC routing table needs to know: "packets destined for 10.244.1.0/24 should go to node1's VM."

The route controller watches Node objects and calls your Routes.CreateRoute() for each node, ensuring the VPC routing table is up to date. When nodes are removed, it calls DeleteRoute().

Note: If you use a CNI plugin that handles its own routing (e.g., Calico with BGP, or a VPC CNI that assigns VPC IPs directly to Pods), you may not need the route controller at all.

How CCM Provisions a Load Balancer

How CCM provisions a load balancer

User kubectl apply

→

API Server stores Service

→

CCM service controller

→

Cloud API create LB

→

Node NodePort → Pod

1

Building Your CCM: Practical Guide

Here's a conceptual CCM skeleton for your cloud. The actual method set, imports, and command wiring vary a bit across Kubernetes releases, so treat this as pseudocode that shows the shape of the integration rather than copy-paste-ready source.

        Conceptual CCM skeleton (pseudocode)
        Go
      

package main

import (
    "k8s.io/cloud-provider"
    "k8s.io/cloud-provider/app"
)

const ProviderName = "mycloud"

// MyCloud implements cloud.Interface
type MyCloud struct {
    client    *MyCloudAPIClient   // your cloud's SDK client
    region    string
    clusterID string
}

func newMyCloud(config io.Reader) (cloudprovider.Interface, error) {
    // Parse config, create cloud API client
    cfg := parseConfig(config)
    return &MyCloud{
        client:    NewMyCloudClient(cfg.APIEndpoint, cfg.Credentials),
        region:    cfg.Region,
        clusterID: cfg.ClusterID,
    }, nil
}

func (c *MyCloud) Initialize(clientBuilder cloudprovider.ControllerClientBuilder,
    stop <-chan struct{}) {
    // Called once at startup. Use clientBuilder to create Kubernetes clients
    // if you need to read/write Kubernetes objects directly.
}

func (c *MyCloud) LoadBalancer() (cloudprovider.LoadBalancer, bool) {
    return &MyCloudLB{client: c.client}, true
}

func (c *MyCloud) InstancesV2() (cloudprovider.InstancesV2, bool) {
    return &MyCloudInstances{client: c.client}, true
}

func (c *MyCloud) Routes() (cloudprovider.Routes, bool) {
    return &MyCloudRoutes{client: c.client}, true
}

func (c *MyCloud) Zones() (cloudprovider.Zones, bool)     { return nil, false }
func (c *MyCloud) Clusters() (cloudprovider.Clusters, bool) { return nil, false }
func (c *MyCloud) ProviderName() string                    { return ProviderName }
func (c *MyCloud) HasClusterID() bool                      { return true }

func init() {
    // Register the cloud provider so Kubernetes knows about it
    cloudprovider.RegisterCloudProvider(ProviderName, func(config io.Reader) (cloudprovider.Interface, error) {
        return newMyCloud(config)
    })
}

func main() {
    // Use the cloud-provider framework to run the CCM.
    // This handles leader election, signal handling, and running controllers.
    command := app.NewCloudControllerManagerCommand()
    command.Execute()
}
      

Implementation Pitfalls

Rate limiting: Cloud API calls can be slow and rate-limited. Your CCM will be making API calls for every node health check, every LB operation, every route update. Implement caching and rate limiting, or you'll hit your cloud's API limits.
Idempotency: Every method must be idempotent. EnsureLoadBalancer() might be called multiple times for the same Service (e.g., on CCM restart). It must create the LB if it doesn't exist and update it if it does, without creating duplicates.
Error handling: Transient cloud API errors should not cause the controller to give up. The reconciliation loop will retry, but make sure you're returning appropriate errors so the item is re-queued.

Section 10

Cluster Bootstrapping

How a Kubernetes cluster goes from nothing to a running system. Understanding this process deeply is essential for building a managed service that provisions clusters reliably.

The Certificate Hierarchy

Kubernetes security relies heavily on TLS certificates and, in many core control-plane paths, mutual TLS (mTLS). But not every interaction uses client certificates: Kubernetes also uses bootstrap tokens, service account tokens, OIDC, and other auth mechanisms. You still need a robust PKI (Public Key Infrastructure) before you can start the core control plane safely.

Certificate Hierarchy

Cluster Root CA The trust anchor for the entire cluster

↓ signs

API Server serving cert

API Server → kubelet client cert

Kubelet client certs (per node)

Controller Mgr client cert

Scheduler client cert

↓

etcd CA (often separate) Separate CA for etcd peer and client certs

↓ signs

etcd peer node-to-node

etcd server serving cert

API Server → etcd client cert

Certificates You Need to Generate

Certificate	Used By	Purpose
Cluster CA key + cert	Everything	Root of trust. All certs are signed by this CA. The CA cert is distributed to all components so they can verify each other.
API server serving cert	API server	TLS server cert. Must include SANs for: the API server DNS name, the ClusterIP of the `kubernetes` Service, the load balancer IP/hostname, and `localhost`.
API server kubelet client cert	API server	Used when the API server connects to kubelet (e.g., for `kubectl logs`, `kubectl exec`). Must be in the `system:masters` group.
Controller manager client cert	Controller manager	Client cert to authenticate to the API server. Grants the controller manager identity.
Scheduler client cert	Scheduler	Client cert to authenticate to the API server.
etcd CA + peer certs	etcd	Separate CA for etcd. Peer certs for node-to-node communication. Server certs for serving client connections.
API server etcd client cert	API server	Client cert signed by etcd CA, used by API server to connect to etcd.
Front proxy CA + client cert	API server	Used for aggregated API servers (e.g., metrics-server). The API server uses the front proxy client cert when forwarding requests to extension API servers.
Kubelet client certs	Kubelet (per node)	Each kubelet gets a unique client cert, typically via TLS bootstrapping (see below).
Service account signing key	Controller manager, API server	An RSA key pair used to sign and verify service account tokens (JWTs).

The Bootstrap Process

Here's the complete sequence for standing up a Kubernetes cluster from scratch. For a managed service, you'll automate every step of this.

Generate the PKI

Generate the root CA, etcd CA, front proxy CA, and all certificates listed above. Use a tool like cfssl, OpenSSL, or your cloud's KMS/certificate service. Store the CA private keys securely — they're the keys to the kingdom.

Start etcd

Start the etcd cluster on 3 (or 5) nodes. Each node needs: the etcd binary, its peer certificate, its server certificate, and the etcd CA certificate. The initial cluster configuration specifies all members. Wait for the cluster to form, elect a leader, and become healthy. Verify with etcdctl endpoint health.

Start the API Server

Start kube-apiserver with the following critical flags: --etcd-servers (etcd endpoints), --service-cluster-ip-range (e.g., 10.96.0.0/16), --tls-cert-file and --tls-private-key-file (serving cert), --client-ca-file (cluster CA for client auth), --etcd-certfile and --etcd-keyfile (etcd client cert), --service-account-key-file, --service-account-signing-key-file, and --service-account-issuer. The API server is now running, but it's a cluster of one with no scheduler, no controllers, and no nodes.

Start Controller Manager and Scheduler

Start kube-controller-manager with: --kubeconfig (pointing at the API server with the controller-manager client cert), --cluster-signing-cert-file and --cluster-signing-key-file (the CA cert/key for signing kubelet certificate requests), --cloud-provider=external (since you're using a CCM). Start kube-scheduler similarly with its kubeconfig. Start the CCM with its kubeconfig and cloud credentials.

Install Cluster Add-ons

Use kubectl apply to install essential add-ons: CoreDNS (for service discovery DNS), kube-proxy (for Service routing — often as a DaemonSet), the CNI plugin (for pod networking). These run as Pods inside the cluster.

TLS Bootstrapping for Nodes

Create a bootstrap token (or configure a token signing key). When new worker nodes start, their kubelet uses this bootstrap token to authenticate to the API server and submit a Certificate Signing Request (CSR). An approver approves eligible kubelet CSRs (if auto-approval is enabled), and the controller-manager's signing controller signs the approved request and issues a client certificate. The kubelet switches from the bootstrap token to its new certificate. This is how nodes can join the cluster without pre-provisioned certificates.

Node Joins the Cluster

The kubelet starts, uses its new certificate to connect to the API server, and registers a Node object. The CCM's node controller sees the new uninitialized node, fetches metadata from your cloud API, and initializes it (adds addresses, labels, removes the taint). The scheduler can now place Pods on this node. The cluster is ready.

kubeadm: The Reference Implementation

kubeadm is the official tool for bootstrapping Kubernetes clusters. While you'll likely build your own provisioning system for a managed service, understanding kubeadm's flow is valuable because it implements the same bootstrap process described above.

kubeadm init: Generates certificates, starts the control plane as static pods (manifests in /etc/kubernetes/manifests/ that the kubelet auto-starts), installs CoreDNS, and outputs a kubeadm join command with a bootstrap token.
kubeadm join: Uses the bootstrap token to connect to the API server, validates the CA via a discovery token hash, requests a client certificate via TLS bootstrapping, and starts the kubelet.

For a managed service, you'll replace kubeadm with your own automation — but the underlying certificate generation, etcd bootstrapping, and component startup sequence is the same.

Key Insight for Managed Services

In a managed service, the control plane and worker nodes are in separate failure domains. The control plane runs in your cloud provider infrastructure (possibly a different VPC). Worker nodes run in the customer's infrastructure. The bootstrap token or join mechanism must securely bridge this gap. You'll likely implement a custom join flow where your provisioning API generates a short-lived token, injects it into the node's startup script (via cloud-init or user data), and the node uses it to join the correct cluster.