mirror of
https://gitea.com/mcereda/oam.git
synced 2026-02-09 05:44:23 +00:00
1453 lines
54 KiB
Markdown
1453 lines
54 KiB
Markdown
# Kubernetes
|
||
|
||
Open source container orchestration engine for containerized applications.<br />
|
||
Hosted by the [Cloud Native Computing Foundation][cncf].
|
||
|
||
1. [Concepts](#concepts)
|
||
1. [Control plane](#control-plane)
|
||
1. [API server](#api-server)
|
||
1. [`kube-scheduler`](#kube-scheduler)
|
||
1. [`kube-controller-manager`](#kube-controller-manager)
|
||
1. [`cloud-controller-manager`](#cloud-controller-manager)
|
||
1. [Worker Nodes](#worker-nodes)
|
||
1. [`kubelet`](#kubelet)
|
||
1. [`kube-proxy`](#kube-proxy)
|
||
1. [Container runtime](#container-runtime)
|
||
1. [Addons](#addons)
|
||
1. [Workloads](#workloads)
|
||
1. [Pods](#pods)
|
||
1. [Best practices](#best-practices)
|
||
1. [Volumes](#volumes)
|
||
1. [hostPaths](#hostpaths)
|
||
1. [emptyDirs](#emptydirs)
|
||
1. [configMaps](#configmaps)
|
||
1. [secrets](#secrets)
|
||
1. [nfs](#nfs)
|
||
1. [downwardAPI](#downwardapi)
|
||
1. [PersistentVolumes](#persistentvolumes)
|
||
1. [Resize PersistentVolumes](#resize-persistentvolumes)
|
||
1. [Authorization](#authorization)
|
||
1. [RBAC](#rbac)
|
||
1. [Autoscaling](#autoscaling)
|
||
1. [Pod scaling](#pod-scaling)
|
||
1. [Horizontal Pod Autoscaler](#horizontal-pod-autoscaler)
|
||
1. [Vertical Pod Autoscaler](#vertical-pod-autoscaler)
|
||
1. [Node scaling](#node-scaling)
|
||
1. [Scheduling](#scheduling)
|
||
1. [Dedicate Nodes to specific workloads](#dedicate-nodes-to-specific-workloads)
|
||
1. [Spread Pods on Nodes](#spread-pods-on-nodes)
|
||
1. [Quality of service](#quality-of-service)
|
||
1. [Containers with high privileges](#containers-with-high-privileges)
|
||
1. [Capabilities](#capabilities)
|
||
1. [Privileged container vs privilege escalation](#privileged-container-vs-privilege-escalation)
|
||
1. [Sysctl settings](#sysctl-settings)
|
||
1. [Backup and restore](#backup-and-restore)
|
||
1. [Managed Kubernetes Services](#managed-kubernetes-services)
|
||
1. [Best practices in cloud environments](#best-practices-in-cloud-environments)
|
||
1. [Edge computing](#edge-computing)
|
||
1. [Troubleshooting](#troubleshooting)
|
||
1. [Golang applications have trouble performing as expected](#golang-applications-have-trouble-performing-as-expected)
|
||
1. [Recreate Pods upon ConfigMap's or Secret's content change](#recreate-pods-upon-configmaps-or-secrets-content-change)
|
||
1. [Run a command in a Pod right after its initialization](#run-a-command-in-a-pod-right-after-its-initialization)
|
||
1. [Run a command just before a Pod stops](#run-a-command-just-before-a-pod-stops)
|
||
1. [Examples](#examples)
|
||
1. [Create an admission webhook](#create-an-admission-webhook)
|
||
1. [Further readings](#further-readings)
|
||
1. [Sources](#sources)
|
||
|
||
## Concepts
|
||
|
||
When using Kubernetes, one is using a cluster.
|
||
|
||
Kubernetes clusters consist of one or more hosts (_Nodes_) executing containerized applications.<br/>
|
||
In cloud environments, Nodes are also available in grouped sets (_Node Pools_) capable of automatic scaling.
|
||
|
||
Nodes host application workloads in the form of [_Pods_][pods].
|
||
|
||
The [_control plane_](#control-plane) manages the cluster's Nodes and Pods.
|
||
|
||

|
||
|
||
### Control plane
|
||
|
||
Makes global decisions about the cluster (like scheduling).<br/>
|
||
Detects and responds to cluster events (like starting up a new Pod when a deployment has less replicas then it
|
||
requests).<br/>
|
||
Exposes the Kubernetes APIs and interfaces used to define, deploy, and manage the lifecycle of the cluster's
|
||
resources.
|
||
|
||
The control plane is composed by:
|
||
|
||
- The [API server](#api-server);
|
||
- The _distributed store_ for the cluster's configuration data.<br/>
|
||
The current default store of choice is [`etcd`][etcd].
|
||
- The [scheduler](#kube-scheduler);
|
||
- The [cluster controller](#kube-controller-manager);
|
||
- The [cloud controller](#cloud-controller-manager).
|
||
|
||
Control plane components run on one or more cluster Nodes as Pods.<br/>
|
||
For ease of use, setup scripts typically start all control plane components on the **same** host and avoid **running**
|
||
other workloads on it.<br/>
|
||
In higher environments, the control plane usually runs across multiple **dedicated** Nodes in order to provide improved
|
||
fault-tolerance and high availability.
|
||
|
||
#### API server
|
||
|
||
Exposes the Kubernetes API. It is the front end for, and the core of, the Kubernetes control plane.<br/>
|
||
`kube-apiserver` is the main implementation of the Kubernetes API server, and is designed to scale horizontally (by
|
||
deploying more instances) and balance traffic between its instances.
|
||
|
||
The API server exposes the HTTP API that lets end users, different parts of a cluster and external components
|
||
communicate with one another, or query and manipulate the state of API objects in Kubernetes.<br/>
|
||
Can be accessed through command-line tools or directly using REST calls.<br/>
|
||
The serialized state of the objects is stored by writing them into `etcd`'s store.
|
||
|
||
Suggested the use of one of the available client libraries if writing an application using the Kubernetes API.<br/>
|
||
The complete API details are documented using OpenAPI.
|
||
|
||
Kubernetes supports multiple API versions, each at a different API path (e.g.: `/api/v1`,
|
||
`/apis/rbac.authorization.k8s.io/v1alpha1`).<br/>
|
||
All the different versions are representations of the same persisted data.<br/>
|
||
The server handles the conversion between API versions transparently.
|
||
|
||
Versioning is done at the API level, rather than at the resource or field level, to ensure the API presents a clear and
|
||
consistent view of system resources and behavior.<br/>
|
||
Also enables controlling access to end-of-life and/or experimental APIs.
|
||
|
||
API groups can be enabled or disabled.<br/>
|
||
API resources are distinguished by their **API group**, **resource type**, **namespace** (for namespaced resources), and
|
||
**name**.<br />
|
||
New API resources and new resource fields can be added often and frequently.<br/>
|
||
Elimination of resources or fields requires following the [API deprecation policy].
|
||
|
||
The Kubernetes API can be extended:
|
||
|
||
- using _custom resources_ to declaratively define how the API server should provide your chosen resource API, or
|
||
- extending the Kubernetes API by implementing an aggregation layer.
|
||
|
||
#### `kube-scheduler`
|
||
|
||
Detects newly created Pods with no assigned Node, and selects one for them to run on.
|
||
|
||
Scheduling decisions take into account:
|
||
|
||
- individual and collective resource requirements;
|
||
- hardware/software/policy constraints;
|
||
- Affinity and anti-Affinity specifications;
|
||
- data locality;
|
||
- inter-workload interference;
|
||
- deadlines.
|
||
|
||
#### `kube-controller-manager`
|
||
|
||
Runs _controller_ processes.<br />
|
||
Each controller is a separate process logically speaking; they are all compiled into a single binary and run in a single
|
||
process to reduce complexity.
|
||
|
||
Examples of these controllers are:
|
||
|
||
- the Node controller, which notices and responds when Nodes go down;
|
||
- the Replication controller, which maintains the correct number of Pods for every replication controller object in the
|
||
system;
|
||
- the Job controller, which checks one-off tasks (_Job_) objects and creates Pods to run them to completion;
|
||
- the EndpointSlice controller, which populates _EndpointSlice_ objects providing a link between services and Pods;
|
||
- the ServiceAccount controller, which creates default ServiceAccounts for new Namespaces.
|
||
|
||
#### `cloud-controller-manager`
|
||
|
||
Embeds cloud-specific control logic, linking clusters to one's cloud provider's API and separating the components that
|
||
interact with that cloud platform from the components that only interact with clusters.
|
||
|
||
Clusters only run controllers that are specific to one's cloud provider.<br/>
|
||
If running Kubernetes on one's own premises, or in a learning environment inside one's own PC, the cluster will have no
|
||
cloud controller managers.
|
||
|
||
As with the `kube-controller-manager`, cloud controller managers combine several logically independent control loops
|
||
into single binaries run as single processes.<br/>
|
||
It can scale horizontally to improve performance or to help tolerate failures.
|
||
|
||
The following controllers can have cloud provider dependencies:
|
||
|
||
- the Node controller, which checks the cloud provider to determine if a Node has been deleted in the cloud after it
|
||
stops responding;
|
||
- the route controller, which sets up routes in the underlying cloud infrastructure;
|
||
- the service controller, which creates, updates and deletes cloud provider load balancers.
|
||
|
||
### Worker Nodes
|
||
|
||
Each and every Node runs components providing a runtime environment for the cluster, and syncing with the control plane
|
||
to maintain workloads running as requested.
|
||
|
||
#### `kubelet`
|
||
|
||
A `kubelet` runs as an agent on each and every Node in the cluster, making sure that containers are run in a Pod.
|
||
|
||
It takes a set of _PodSpecs_ and ensures that the containers described in them are running and healthy.<br/>
|
||
It only manages containers created by Kubernetes.
|
||
|
||
#### `kube-proxy`
|
||
|
||
Network proxy running on each Node and implementing part of the Kubernetes Service concept.
|
||
|
||
It maintains all the network rules on Nodes which allow network communication to the Pods from network sessions inside
|
||
or outside of one's cluster.
|
||
|
||
It uses the operating system's packet filtering layer, if there is one and it's available; if not, it just forwards the
|
||
traffic itself.
|
||
|
||
#### Container runtime
|
||
|
||
The software responsible for running containers.
|
||
|
||
Kubernetes supports container runtimes like `containerd`, `CRI-O`, and any other implementation of the Kubernetes CRI
|
||
(Container Runtime Interface).
|
||
|
||
#### Addons
|
||
|
||
Addons use Kubernetes resources (_DaemonSet_, _Deployment_, etc) to implement cluster features.<br/>
|
||
As such, namespaced resources for addons belong within the `kube-system` namespace.
|
||
|
||
See [addons] for an extended list of the available addons.
|
||
|
||
### Workloads
|
||
|
||
Workloads consist of groups of containers ([_Pods_][pods]) and a specification for how to run them (_Manifest_).<br/>
|
||
Manifest files are written in YAML (preferred) or JSON format and are composed of:
|
||
|
||
- metadata,
|
||
- resource specifications, with attributes specific to the kind of resource they are describing, and
|
||
- status, automatically generated and edited by the control plane.
|
||
|
||
#### Pods
|
||
|
||
The smallest deployable unit of computing that one can create and manage in Kubernetes.<br/>
|
||
Pods contain one or more relatively tightly coupled application containers; they are always co-located (executed on the
|
||
same host) and co-scheduled (executed together), and **share** context, storage and network resources, and a
|
||
specification for how to run them.
|
||
|
||
Pods are (and _should be_) usually created trough other workload resources (like _Deployments_, _StatefulSets_, or
|
||
_Jobs_) and **not** directly.<br/>
|
||
Such parent resources leverage and manage _ReplicaSets_, which in turn manage copies of the same Pod. When deleted,
|
||
**all** the resources they manage are deleted with them.
|
||
|
||
Gotchas:
|
||
|
||
- If a Container specifies a memory or CPU `limit` but does **not** specify a memory or CPU `request`, Kubernetes
|
||
automatically assigns it a resource `request` spec equal to the given `limit`.
|
||
|
||
## Best practices
|
||
|
||
Also see [configuration best practices] and the [production best practices checklist].
|
||
|
||
- Prefer an **updated** version of Kubernetes.<br/>
|
||
The upstream project maintains release branches for the most recent three minor releases.<br/>
|
||
Kubernetes 1.19 and newer receive approximately 1 year of patch support. Kubernetes 1.18 and older received
|
||
approximately 9 months of patch support.
|
||
- Prefer **stable** versions of Kubernetes for production clusters.
|
||
- Prefer using **multiple Nodes** for production clusters.
|
||
- Prefer **consistent** versions of Kubernetes components throughout **all** Nodes.<br/>
|
||
Components support [version skew][version skew policy] up to a point, with specific tools placing additional
|
||
restrictions.
|
||
- Consider keeping **separation of ownership and control** and/or group related resources.<br/>
|
||
Leverage [Namespaces].
|
||
- Consider **organizing** cluster and workload resources.<br/>
|
||
Leverage [Labels][labels and selectors]; see [recommended Labels].
|
||
- Consider forwarding logs to a central log management system for better storage and easier access.
|
||
- Avoid sending traffic to Pods which are not ready to manage it.<br/>
|
||
[Readiness probes][Configure Liveness, Readiness and Startup Probes] signal services to not forward requests until the
|
||
probe verifies its own Pod is up.<br/>
|
||
[Liveness probes][configure liveness, readiness and startup probes] ping the Pod for a response and check its health;
|
||
if the check fails, they kill the current Pod and launch a new one.
|
||
- Avoid workloads and Nodes fail due limited resources being available.<br/>
|
||
Set [resource requests and limits][resource management for pods and containers] to reserve a minimum amount of
|
||
resources for Pods and limit their hogging abilities.
|
||
- Prefer smaller container images.
|
||
- Prioritize critical workloads.<br/>
|
||
Leverage [quality of service](#quality-of-service).
|
||
- Instrument workloads to detect and respond to the `SIGTERM` signal to allow them to safely and cleanly shutdown.
|
||
- Avoid using bare Pods.<br/>
|
||
Prefer defining them as part of a replica-based resource, like Deployments, StatefulSets, ReplicaSets or DaemonSets.
|
||
- Leverage [autoscalers](#autoscaling).
|
||
- Try to avoid workload disruption.<br/>
|
||
Leverage Pod disruption budgets.
|
||
- Try to use all available Nodes.<br/>
|
||
Leverage affinities, taint and tolerations.
|
||
- Push for automation.<br/>
|
||
[GitOps].
|
||
- Apply the principle of least privilege.<br/>
|
||
Reduce container privileges where possible.<br/>
|
||
Leverage Role-based access control (RBAC).
|
||
- Restrict traffic between objects in the cluster.<br/>
|
||
See [network policies].
|
||
- Continuously audit events and logs regularly, also for control plane components.
|
||
- Keep an eye on connection tables.<br/>
|
||
Specially valid when using [connection tracking].
|
||
- Protect the cluster's ingress points.<br/>
|
||
Firewalls, web application firewalls, application gateways.
|
||
|
||
## Volumes
|
||
|
||
Refer [volumes].
|
||
|
||
Sources to mount directories from.
|
||
|
||
They go by the `volumes` key in Pods' `spec`.<br/>
|
||
E.g., in a Deployment they are declared in its `spec.template.spec.volumes`:
|
||
|
||
```yaml
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
spec:
|
||
template:
|
||
spec:
|
||
volumes:
|
||
- <volume source 1>
|
||
- <volume source N>
|
||
```
|
||
|
||
Mount volumes in containers by using the `volumesMount`:
|
||
|
||
```yaml
|
||
apiVersion: apps/v1
|
||
kind: Pod
|
||
spec:
|
||
containers:
|
||
- name: some-container
|
||
volumeMounts:
|
||
- name: my-volume-source
|
||
mountPath: /path/to/mount
|
||
readOnly: false
|
||
subPath: dir/in/volume
|
||
```
|
||
|
||
### hostPaths
|
||
|
||
Mount files or directories from the host Node's filesystem into Pods.
|
||
|
||
**Not** something most Pods will need, but powerful escape hatches for some applications.
|
||
|
||
Use cases:
|
||
|
||
- Containers needing access to Node-level system components<br/>
|
||
E.g., containers transferring system logs to a central location and needing access to those logs using a read-only
|
||
mount of `/var/log`.
|
||
- Making configuration files stored on the host system available read-only to _static_ Pods.
|
||
This because static Pods **cannot** access ConfigMaps.
|
||
|
||
If mounted files or directories on the host are only accessible to `root`:
|
||
|
||
- Either the process needs to run as `root` in a privileged container,
|
||
- Or the files' permissions on the host need to be changed to allow the process to read from (or write to) the volume.
|
||
|
||
```yaml
|
||
apiVersion: apps/v1
|
||
kind: Pod
|
||
volumes:
|
||
- name: example-volume
|
||
# Mount '/data/foo' only if that directory already exists
|
||
hostPath:
|
||
path: /data/foo # location on host
|
||
type: Directory # optional
|
||
```
|
||
|
||
### emptyDirs
|
||
|
||
Scrape disks for **temporary** Pod data.
|
||
|
||
**Not** shared between Pods.<br/>
|
||
All data is **destroyed** once the Pod is removed, but stays intact when Pods restart.
|
||
|
||
Use cases:
|
||
|
||
- Provide directories to create pid/lock or other special files for 3rd-party software when it's inconvenient or
|
||
impossible to disable them.<br/>
|
||
E.g., Java Hazelcast creates lockfiles in the user's home directory and there's no way to disable this behaviour.
|
||
- Store intermediate calculations which can be lost<br/>
|
||
E.g., external sorting, buffering of big responses to save memory.
|
||
- Improve startup time after application crashes if the application in question pre-computes something before or during
|
||
startup.</br>
|
||
E.g., compressed assets in the application's image, decompressing data into temporary directory.
|
||
|
||
```yaml
|
||
apiVersion: apps/v1
|
||
kind: Pod
|
||
volumes:
|
||
- name: my-empty-dir
|
||
emptyDir:
|
||
# Omit the 'medium' field to use disk storage.
|
||
# The 'Memory' medium will create tmpfs to store data.
|
||
medium: Memory
|
||
sizeLimit: 1Gi
|
||
```
|
||
|
||
### configMaps
|
||
|
||
Inject configuration data into Pods.
|
||
|
||
When referencing a ConfigMap:
|
||
|
||
- Provide the name of the ConfigMap in the volume.
|
||
- Optionally customize the path to use for a specific entry in the ConfigMap.
|
||
|
||
```yaml
|
||
apiVersion: apps/v1
|
||
kind: Pod
|
||
spec:
|
||
containers:
|
||
- name: test
|
||
volumeMounts:
|
||
- name: config-vol
|
||
mountPath: /etc/config
|
||
volumes:
|
||
- name: config-vol
|
||
configMap:
|
||
name: log-config
|
||
items:
|
||
- key: log_level
|
||
path: log_level
|
||
- name: my-configmap-volume
|
||
configMap:
|
||
name: my-configmap
|
||
defaultMode: 0644 # posix access mode, set it to the most restricted value
|
||
optional: true # allow pods to start with this configmap missing, resulting in an empty directory
|
||
```
|
||
|
||
ConfigMaps **must** be created before they can be mounted.
|
||
|
||
One ConfigMap can be mounted into any number of Pods.
|
||
|
||
ConfigMaps are always mounted `readOnly`.
|
||
|
||
Containers using ConfigMaps as `subPath` volume mounts will **not** receive ConfigMap updates.
|
||
|
||
Text data is exposed as files using the UTF-8 character encoding.<br/>
|
||
Use `binaryData` For any other character encoding.
|
||
|
||
### secrets
|
||
|
||
Used to pass sensitive information to Pods.<br/>
|
||
E.g., passwords.
|
||
|
||
They behave like ConfigMaps but are backed by `tmpfs`, so they are never written to non-volatile storage.
|
||
|
||
Secrets **must** be created before they can be mounted.
|
||
|
||
Secrets are always mounted `readOnly`.
|
||
|
||
Containers using Secrets as `subPath` volume mounts will **not** receive Secret updates.
|
||
|
||
```yaml
|
||
apiVersion: apps/v1
|
||
kind: Pod
|
||
spec:
|
||
volumes:
|
||
- name: my-secret-volume
|
||
secret:
|
||
secretName: my-secret
|
||
defaultMode: 0644
|
||
optional: false
|
||
```
|
||
|
||
### nfs
|
||
|
||
mount **existing** NFS shares into Pods.
|
||
|
||
The contents of NFS volumes are preserved after Pods are removed and the volume is merely unmounted.<br/>
|
||
This means that NFS volumes can be pre-populated with data, and that data can be shared between Pods.
|
||
|
||
NFS can be mounted by multiple writers simultaneously.
|
||
|
||
One **cannot** specify NFS mount options in a Pod spec.<br/>
|
||
Either set mount options server-side or use `/etc/nfsmount.conf`.<br/>
|
||
Alternatively, mount NFS volumes via PersistentVolumes as they do allow to set mount options.
|
||
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Pod
|
||
spec:
|
||
containers:
|
||
- image: registry.k8s.io/test-web-server
|
||
name: test-container
|
||
volumeMounts:
|
||
- mountPath: /my-nfs-data
|
||
name: test-volume
|
||
volumes:
|
||
- name: test-volume
|
||
nfs:
|
||
server: my-nfs-server.example.com
|
||
path: /my-nfs-volume
|
||
readOnly: true
|
||
```
|
||
|
||
### downwardAPI
|
||
|
||
Downward APIs expose Pods' and containers' resource declaration or status field values.<br/>
|
||
Refer [Expose Pod information to Containers through files].
|
||
|
||
Downward API volumes make downward API data available to applications as read-only files in plain text format.
|
||
|
||
Containers using the downward API as `subPath` volume mounts will **not** receive updates when field values change.
|
||
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Pod
|
||
metadata:
|
||
labels:
|
||
cluster: test-cluster1
|
||
rack: rack-22
|
||
zone: us-east-coast
|
||
spec:
|
||
volumes:
|
||
- name: my-downward-api-volume
|
||
downwardAPI:
|
||
defaultMode: 0644
|
||
items:
|
||
- path: labels
|
||
fieldRef:
|
||
fieldPath: metadata.labels
|
||
|
||
# Mounting this volume results in a file with contents similar to the following:
|
||
# ```plaintext
|
||
# cluster="test-cluster1"
|
||
# rack="rack-22"
|
||
# zone="us-east-coast"
|
||
# ```
|
||
```
|
||
|
||
### PersistentVolumes
|
||
|
||
#### Resize PersistentVolumes
|
||
|
||
1. Check the `StorageClass` is set with `allowVolumeExpansion: true`:
|
||
|
||
```sh
|
||
kubectl get storageClass 'storage-class-name' -o jsonpath='{.allowVolumeExpansion}'
|
||
```
|
||
|
||
1. Edit the PersistentVolumeClaim's `spec.resources.requests.storage` field.<br/>
|
||
This will take care of the underlying PersistentVolume's size automagically.
|
||
|
||
```sh
|
||
kubectl edit persistentVolumeClaim 'my-pvc'
|
||
```
|
||
|
||
1. Verify the change by checking the PVC's `status.capacity` field:
|
||
|
||
```sh
|
||
kubectl get pvc 'my-pvc' -o jsonpath='{.status}'
|
||
```
|
||
|
||
Should one see the message
|
||
|
||
> Waiting for user to (re-)start a pod to finish file system resize of volume on node
|
||
|
||
under the `status.conditions` field, just wait some time.<br/>
|
||
It should **not** be necessary to restart the Pods, and the capacity should change soon to the requested one.
|
||
|
||
Gotchas:
|
||
|
||
- It's possible to recreate StatefulSets **without** the need of killing the Pods it controls.<br/>
|
||
Reapply the STS' declaration with a new PersistentVolume size, and start new Pods to resize the underlying filesystem.
|
||
|
||
<details>
|
||
<summary>If deploying the STS via Helm</summary>
|
||
|
||
1. Change the size of the PersistentVolumeClaims used by the STS:
|
||
|
||
```sh
|
||
kubectl edit persistentVolumeClaims 'my-pvc'
|
||
```
|
||
|
||
1. Delete the STS **without killing its Pods**:
|
||
|
||
```sh
|
||
kubectl delete statefulSets.apps 'my-sts' --cascade 'orphan'
|
||
```
|
||
|
||
1. Redeploy the STS with the changed size.
|
||
It will retake ownership of existing Pods.
|
||
|
||
1. Delete the STS' Pods one-by-one.<br/>
|
||
During Pod restart, the Kubelet will resize the filesystem to match new block device size.
|
||
|
||
```sh
|
||
kubectl delete pod 'my-sts-pod'
|
||
```
|
||
|
||
</details>
|
||
<details>
|
||
<summary>If managing the STS manually</summary>
|
||
|
||
1. Change the size of the PersistentVolumeClaims used by the STS:
|
||
|
||
```sh
|
||
kubectl edit persistentVolumeClaims 'my-pvc'
|
||
```
|
||
|
||
1. Note down the names of PVs for specific PVCs and their sizes:
|
||
|
||
```sh
|
||
kubectl get persistentVolume 'my-pv'
|
||
```
|
||
|
||
1. Dump the STS to disk:
|
||
|
||
```sh
|
||
kubectl get sts 'my-sts' -o yaml > 'my-sts.yaml'
|
||
```
|
||
|
||
1. Remove any extra field (like `metadata.{selfLink,resourceVersion,creationTimestamp,generation,uid}` and `status`)
|
||
and set the template's PVC size to the value you want.
|
||
|
||
1. Delete the STS **without killing its Pods**:
|
||
|
||
```sh
|
||
kubectl delete sts 'my-sts' --cascade 'orphan'
|
||
```
|
||
|
||
1. Reapply the STS.<br/>
|
||
It will retake ownership of existing Pods.
|
||
|
||
```sh
|
||
kubectl apply -f 'my-sts.yaml'
|
||
```
|
||
|
||
1. Delete the STS' Pods one-by-one.<br/>
|
||
During Pod restart, the Kubelet will resize the filesystem to match new block device size.
|
||
|
||
```sh
|
||
kubectl delete pod 'my-sts-pod'
|
||
```
|
||
|
||
</details>
|
||
|
||
## Authorization
|
||
|
||
### RBAC
|
||
|
||
Refer [Using RBAC Authorization].
|
||
|
||
_Role_s and _ClusterRole_s contain rules, each representing a set of permissions.<br/>
|
||
Permissions are purely additive - there are no _deny_ rules.
|
||
|
||
Roles are constrained to the namespace they are defined into.<br/>
|
||
ClusterRoles are **non**-namespaced resources, and are meant for cluster-wide roles.
|
||
|
||
<details style='padding: 0 0 0 1rem'>
|
||
<summary>Role definition example</summary>
|
||
|
||
```yaml
|
||
apiVersion: rbac.authorization.k8s.io/v1
|
||
kind: Role
|
||
metadata:
|
||
namespace: default
|
||
name: pod-reader
|
||
rules:
|
||
- apiGroups:
|
||
- "" # "" = core API group
|
||
resources:
|
||
- pods
|
||
verbs:
|
||
- get
|
||
- list
|
||
- watch
|
||
```
|
||
|
||
</details>
|
||
|
||
<details style='padding: 0 0 1rem 1rem'>
|
||
<summary>ClusterRole definition example</summary>
|
||
|
||
```yaml
|
||
apiVersion: rbac.authorization.k8s.io/v1
|
||
kind: ClusterRole
|
||
metadata:
|
||
# no `namespace` as ClusterRoles are non-namespaced
|
||
name: secret-reader
|
||
rules:
|
||
- apiGroups:
|
||
- "" # "" = core API group
|
||
resources:
|
||
- secrets
|
||
verbs:
|
||
- get
|
||
- list
|
||
- watch
|
||
```
|
||
|
||
</details>
|
||
|
||
Roles are usually used to grant access to workloads in Pods.<br/>
|
||
ClusterRoles are usually used to grant access to cluster-scoped resources (Nodes), non-resource endpoints (`/healthz`),
|
||
and namespaced resources across all namespaces.
|
||
|
||
_RoleBinding_s grant the permissions defined in Roles or ClusterRoles to the _Subjects_ (Users, Groups, or Service
|
||
Accounts) they reference, only within the namespace they are defined.
|
||
_ClusterRoleBinding_s do the same, but cluster-wide.
|
||
|
||
Bindings require the roles and the Subjects they refer to already exist.
|
||
|
||
<details style='padding: 0 0 0 1rem'>
|
||
<summary>RoleBinding definition example</summary>
|
||
|
||
```yaml
|
||
apiVersion: rbac.authorization.k8s.io/v1
|
||
kind: RoleBinding
|
||
metadata:
|
||
name: read-pods
|
||
namespace: default
|
||
subjects:
|
||
- kind: User
|
||
name: jane # case sensitive
|
||
apiGroup: rbac.authorization.k8s.io
|
||
roleRef:
|
||
kind: Role
|
||
name: pod-reader
|
||
apiGroup: rbac.authorization.k8s.io
|
||
---
|
||
apiVersion: rbac.authorization.k8s.io/v1
|
||
kind: RoleBinding
|
||
metadata:
|
||
name: read-secrets
|
||
namespace: development
|
||
subjects:
|
||
- kind: User
|
||
name: bob # case sensitive
|
||
apiGroup: rbac.authorization.k8s.io
|
||
roleRef:
|
||
kind: ClusterRole
|
||
name: secret-reader
|
||
apiGroup: rbac.authorization.k8s.io
|
||
```
|
||
|
||
</details>
|
||
|
||
<details style='padding: 0 0 1rem 1rem'>
|
||
<summary>ClusterRoleBinding definition example</summary>
|
||
|
||
```yaml
|
||
apiVersion: rbac.authorization.k8s.io/v1
|
||
kind: ClusterRoleBinding
|
||
metadata:
|
||
name: read-secrets-global
|
||
subjects:
|
||
- kind: Group
|
||
name: manager # case sensitive
|
||
apiGroup: rbac.authorization.k8s.io
|
||
roleRef:
|
||
kind: ClusterRole
|
||
name: secret-reader
|
||
apiGroup: rbac.authorization.k8s.io
|
||
```
|
||
|
||
</details>
|
||
|
||
Roles, ClusterRoles, RoleBindings and ClusterRoleBindings must be given valid [path segment names].
|
||
|
||
Bindings are **immutable**. After creating a binding, one **cannot** change the Role or ClusterRole it refers to.<br/>
|
||
Trying to change a binding's `roleRef` causes a validation error. To change it, one needs to remove the binding and
|
||
replace it whole.
|
||
|
||
Use the `kubectl auth reconcile` utility to create or update a manifest file containing RBAC objects.<br/>
|
||
It also handles deleting and recreating binding objects, if required, to change the role they refer to.
|
||
|
||
Wildcards can be used in resources and verb entries, but is not advised as it could result in overly permissive access
|
||
being granted to sensitive resources.
|
||
|
||
ClusterRoles can be **aggregated** into a single combined ClusterRole.
|
||
|
||
<details style='padding: 0 0 0 1rem'>
|
||
|
||
A controller watches for ClusterRole objects with `aggregationRule`s.
|
||
|
||
`aggregationRule`s define at least one label selector.<br/>
|
||
That selector will be used by the controller to match and combine other ClusterRoles into the rules field of the source
|
||
one.
|
||
|
||
New ClusterRoles matching the label selector of an existing aggregated ClusterRole will trigger adding the new rules
|
||
into the aggregated ClusterRole.
|
||
|
||
</details>
|
||
|
||
<details style='padding: 0 0 1rem 1rem'>
|
||
<summary>Aggregated ClusterRole definition example</summary>
|
||
|
||
```yaml
|
||
apiVersion: rbac.authorization.k8s.io/v1
|
||
kind: ClusterRole
|
||
metadata:
|
||
name: monitoring-endpoints
|
||
labels:
|
||
rbac.example.com/aggregate-to-monitoring: "true"
|
||
rules:
|
||
- apiGroups: [""]
|
||
resources: ["services", "endpointslices", "pods"]
|
||
verbs: ["get", "list", "watch"]
|
||
---
|
||
apiVersion: rbac.authorization.k8s.io/v1
|
||
kind: ClusterRole
|
||
metadata:
|
||
name: monitoring
|
||
aggregationRule:
|
||
clusterRoleSelectors:
|
||
- matchLabels:
|
||
rbac.example.com/aggregate-to-monitoring: "true"
|
||
rules: [] # The control plane automatically fills in the rules
|
||
```
|
||
|
||
</details>
|
||
|
||
## Autoscaling
|
||
|
||
Controllers are available to scale Pods or Nodes automatically, both in number or size.
|
||
|
||
Automatic scaling of Pods is done in number by the Horizontal Pod Autoscaler, and in size by the
|
||
Vertical Pod Autoscaler.<br/>
|
||
Automatic scaling of Nodes is done in number by the Cluster Autoscaler, and in size by add-ons like [Karpenter].
|
||
|
||
> Be aware of mix-and-matching autoscalers for the same kind of resource.<br/>
|
||
> One can easily defy the work done by the other and make that resource behave unexpectedly.
|
||
|
||
K8S only comes with the Horizontal Pod Autoscaler by default.<br/>
|
||
Managed K8S usually also comes with the [Cluster Autoscaler] if autoscaling is enabled on the cluster resource.
|
||
|
||
The Horizontal and Vertical Pod Autoscalers require to access metrics.<br/>
|
||
This requires the [metrics server] addon to be installed and accessible.
|
||
|
||
### Pod scaling
|
||
|
||
Autoscaling of Pods by number requires the use of the [Horizontal Pod Autoscaler].<br/>
|
||
Autoscaling of Pods by size requires the use of the [Vertical Pod Autoscaler].
|
||
|
||
> Avoid running both the HPA **and** the VPA on the same workload.<br/>
|
||
> The two will easily collide and try to one-up each other, leading to the workload's Pods changing resources **and**
|
||
> number of replicas as frequently as they can.
|
||
|
||
Both HPA and VPA can currently monitor only CPU and Memory.<br/>
|
||
Use add-ons like [KEDA] to scale workloads based on different metrics.
|
||
|
||
#### Horizontal Pod Autoscaler
|
||
|
||
Refer [Horizontal Pod Autoscaling] and [HorizontalPodAutoscaler Walkthrough].<br/>
|
||
See also [HPA not scaling down].
|
||
|
||
The HPA decides on the amount of replicas on the premise of their **current** amount.<br/>
|
||
The algorithm's formula is `desiredReplicas = ceil[ currentReplicas * ( currentMetricValue / desiredMetricValue ) ]`.
|
||
|
||
Downscaling has a default cooldown period.
|
||
|
||
#### Vertical Pod Autoscaler
|
||
|
||
TODO
|
||
|
||
### Node scaling
|
||
|
||
Autoscaling of Nodes by number requires the [Cluster Autoscaler].
|
||
|
||
1. The Cluster Autoscaler routinely checks for pending Pods.
|
||
1. Pods fill up the available Nodes.
|
||
1. When Pods start to fail for lack of available resources, Nodes are added to the cluster.
|
||
1. When Pods are not failing due to lack of available resources and one or more Nodes are underused, the Autoscaler
|
||
tries to fit the existing Pods in less Nodes.
|
||
1. If one or more Nodes can result unused from the previous step (DaemonSets are usually not taken into consideration),
|
||
the Autoscaler will terminate them.
|
||
|
||
Autoscaling of Nodes by size requires add-ons like [Karpenter].
|
||
|
||
## Scheduling
|
||
|
||
When Pods are created, they go to a queue and wait to be scheduled.
|
||
|
||
The scheduler picks a Pod from the queue and tries to schedule it on a Node.<br/>
|
||
If no Node satisfies **all** the requirements of the Pod, preemption logic is triggered for that Pod.
|
||
|
||
Preemption logic tries to find a Node where the removal of one or more other _lower priority_ Pods would allow the
|
||
pending one to be scheduled on that Node.<br/>
|
||
If such a Node is found, one or more other lower priority Pods are evicted from that Node to make space for the pending
|
||
Pod. After the evicted Pods are gone, the pending Pod can be scheduled on that Node.
|
||
|
||
### Dedicate Nodes to specific workloads
|
||
|
||
Leverage [taints][Taints and Tolerations] and [Node Affinity][Affinity and anti-affinity].
|
||
|
||
Refer [Assigning Pods to Nodes].
|
||
|
||
1. Taint the dedicated Nodes:
|
||
|
||
```sh
|
||
$ kubectl taint nodes 'host1' 'dedicated=devs:NoSchedule'
|
||
node "host1" tainted
|
||
```
|
||
|
||
1. Add Labels to the same Nodes:
|
||
|
||
```sh
|
||
$ kubectl label nodes 'host1' 'dedicated=devs'
|
||
node "host1" labeled
|
||
```
|
||
|
||
1. Add matching tolerations and Node Affinity preferences to the dedicated workloads' Pod's `spec`:
|
||
|
||
```yaml
|
||
spec:
|
||
affinity:
|
||
nodeAffinity:
|
||
requiredDuringSchedulingIgnoredDuringExecution:
|
||
nodeSelectorTerms:
|
||
- matchExpressions:
|
||
- key: dedicated
|
||
operator: In
|
||
values:
|
||
- devs
|
||
tolerations:
|
||
- key: "dedicated"
|
||
operator: "Equal"
|
||
value: "devs"
|
||
effect: "NoSchedule"
|
||
```
|
||
|
||
### Spread Pods on Nodes
|
||
|
||
Leverage [Pod Topology Spread Constraints] and/or [Pod anti-affinity][Affinity and anti-affinity].
|
||
|
||
See also [Avoiding Kubernetes Pod Topology Spread Constraint Pitfalls].
|
||
|
||
<details>
|
||
<summary>Basic examples<summary>
|
||
|
||
```yaml
|
||
apiVersion: v1
|
||
kind: Pod
|
||
metadata:
|
||
labels:
|
||
example.org/app: someService
|
||
spec:
|
||
affinity:
|
||
podAntiAffinity:
|
||
preferredDuringSchedulingIgnoredDuringExecution:
|
||
- podAffinityTerm:
|
||
labelSelector:
|
||
matchLabels:
|
||
example.org/app: someService
|
||
topologyKey: kubernetes.io/hostname
|
||
weight: 100
|
||
topologySpreadConstraints:
|
||
- labelSelector:
|
||
matchLabels:
|
||
example.org/app: someService
|
||
maxSkew: 1
|
||
topologyKey: topology.kubernetes.io/zone
|
||
whenUnsatisfiable: DoNotSchedule
|
||
- labelSelector:
|
||
matchLabels:
|
||
example.org/app: someService
|
||
maxSkew: 1
|
||
topologyKey: kubernetes.io/hostname
|
||
whenUnsatisfiable: ScheduleAnyway
|
||
```
|
||
|
||
## Quality of service
|
||
|
||
See [Configure Quality of Service for Pods] for more information.
|
||
|
||
QoS classes are used to make decisions about scheduling and evicting Pods.<br/>
|
||
When a Pod is created, it is also assigned one of the following QoS classes:
|
||
|
||
- _Guaranteed_, when **every** Container in the Pod, including init containers, has:
|
||
|
||
- a memory limit **and** a memory request, **and** they are the same
|
||
- a CPU limit **and** a CPU request, **and** they are the same
|
||
|
||
```yaml
|
||
spec:
|
||
containers:
|
||
…
|
||
resources:
|
||
limits:
|
||
cpu: 700m
|
||
memory: 200Mi
|
||
requests:
|
||
cpu: 700m
|
||
memory: 200Mi
|
||
…
|
||
status:
|
||
qosClass: Guaranteed
|
||
```
|
||
|
||
- _Burstable_, when
|
||
|
||
- the Pod does not meet the criteria for the _Guaranteed_ QoS class
|
||
- **at least one** Container in the Pod has a memory **or** CPU request spec
|
||
|
||
```yaml
|
||
spec:
|
||
containers:
|
||
- name: qos-demo
|
||
…
|
||
resources:
|
||
limits:
|
||
memory: 200Mi
|
||
requests:
|
||
memory: 100Mi
|
||
…
|
||
status:
|
||
qosClass: Burstable
|
||
```
|
||
|
||
- _BestEffort_, when the Pod does not meet the criteria for the other QoS classes (its Containers have **no** memory or
|
||
CPU limits **nor** requests)
|
||
|
||
```yaml
|
||
spec:
|
||
containers:
|
||
…
|
||
resources: {}
|
||
…
|
||
status:
|
||
qosClass: BestEffort
|
||
```
|
||
|
||
## Containers with high privileges
|
||
|
||
Kubernetes [introduced a Security Context][security context design proposal] as a mitigation solution to some workloads
|
||
requiring to change one or more Node settings for performance, stability, or other issues (e.g. [ElasticSearch]).<br/>
|
||
This is usually achieved executing the needed command from an InitContainer with higher privileges than normal, which
|
||
will have access to the Node's resources and breaks the isolation Containers are usually famous for. If compromised, an
|
||
attacker can use this highly privileged container to gain access to the underlying Node.
|
||
|
||
From the design proposal:
|
||
|
||
> A security context is a set of constraints that are applied to a Container in order to achieve the following goals
|
||
> (from the [Security design][Security Design Proposal]):
|
||
>
|
||
> - ensure a **clear isolation** between the Container and the underlying host it runs on;
|
||
> - **limit** the ability of the Container to negatively impact the infrastructure or other Containers.
|
||
>
|
||
> \[The main idea is that] **Containers should only be granted the access they need to perform their work**. The
|
||
> Security Context takes advantage of containerization features such as the ability to
|
||
> [add or remove capabilities][Runtime privilege and Linux capabilities in Docker containers] to give a process some
|
||
> privileges, but not all the privileges of the `root` user.
|
||
|
||
### Capabilities
|
||
|
||
Adding capabilities to a Container is **not** making it _privileged_, **nor** allowing _privilege escalation_. It is
|
||
just giving the Container the ability to write to specific files or devices depending on the given capability.
|
||
|
||
This means having a capability assigned does **not** automatically make the Container able to wreak havoc on a Node, and
|
||
this practice **can be a legitimate use** of this feature instead.
|
||
|
||
From the feature's `man` page:
|
||
|
||
> Linux divides the privileges traditionally associated with superuser into distinct units, known as _capabilities_,
|
||
> which can be independently enabled and disabled. Capabilities are a per-thread attribute.
|
||
|
||
This also means a Container will be **limited** to its contents, plus the capabilities it has been assigned.
|
||
|
||
Some capabilities are assigned to all Containers by default, while others (the ones which could cause more issues)
|
||
require to be **explicitly** set using the Containers' `securityContext.capabilities.add` property.<br/>
|
||
If a Container is _privileged_ (see [Privileged container vs privilege escalation]), it will have access to **all** the
|
||
capabilities, with no regards of what are explicitly assigned to it.
|
||
|
||
Check:
|
||
|
||
- [Linux capabilities], to see what capabilities can be assigned to a process **in a Linux system**;
|
||
- [Runtime privilege and Linux capabilities in Docker containers] for the capabilities available **inside Kubernetes**,
|
||
and
|
||
- [Container capabilities in Kubernetes] for a handy table associating capabilities in Kubernetes to their Linux
|
||
variant.
|
||
|
||
### Privileged container vs privilege escalation
|
||
|
||
A _privileged container_ is very different from a _container leveraging privilege escalation_.
|
||
|
||
A **privileged container** does whatever a processes running directly on the Node can.<br/>
|
||
It will have automatically assigned **all** [capabilities](#capabilities), and being `root` in this container is
|
||
effectively being `root` on the Node it is running on.
|
||
|
||
> For a Container to be _privileged_, its definition **requires the `securityContext.privileged` property set to
|
||
> `true`**.
|
||
|
||
**Privilege escalation** allows **a process inside the Container** to gain more privileges than its parent process.<br/>
|
||
The process will be able to assume `root`-like powers, but will have access only to the **assigned**
|
||
[capabilities](#capabilities) and generally have limited to no access to the Node like any other Container.
|
||
|
||
> For a Container to _leverage privilege escalation_, its definition **requires the
|
||
> `securityContext.allowPrivilegeEscalation` property**:
|
||
>
|
||
> - to **either** be set to `true`, or
|
||
> - to **not be set** at all **if**:
|
||
> - the Container is already privileged, or
|
||
> - the Container has `SYS_ADMIN` capabilities.
|
||
>
|
||
> This property directly controls whether the [`no_new_privs`][No New Privileges Design Proposal] flag gets set on the
|
||
> Container's process.
|
||
|
||
From the [design document for `no_new_privs`][No New Privileges Design Proposal]:
|
||
|
||
> In Linux, the `execve` system call can grant more privileges to a newly-created process than its parent process.
|
||
> Considering security issues, since Linux kernel v3.5, there is a new flag named `no_new_privs` added to prevent those
|
||
> new privileges from being granted to the processes.
|
||
>
|
||
> `no_new_privs` is inherited across `fork`, `clone` and `execve` and **can not be unset**. With `no_new_privs` set,
|
||
> `execve` promises not to grant the privilege to do anything that could not have been done without the `execve` call.
|
||
>
|
||
> For more details about `no_new_privs`, please check the
|
||
> [Linux kernel documentation][no_new_privs linux kernel documentation].
|
||
>
|
||
> \[…]
|
||
>
|
||
> To recap, below is a table defining the default behavior at the pod security policy level and what can be set as a
|
||
> default with a pod security policy:
|
||
>
|
||
> | allowPrivilegeEscalation setting | uid = 0 or unset | uid != 0 | privileged/CAP_SYS_ADMIN |
|
||
> | -------------------------------- | ------------------ | ------------------ | ------------------------ |
|
||
> | nil | no_new_privs=true | no_new_privs=false | no_new_privs=false |
|
||
> | false | no_new_privs=true | no_new_privs=true | no_new_privs=false |
|
||
> | true | no_new_privs=false | no_new_privs=false | no_new_privs=false |
|
||
|
||
## Sysctl settings
|
||
|
||
See [Using `sysctls` in a Kubernetes Cluster][using sysctls in a kubernetes cluster].
|
||
|
||
## Backup and restore
|
||
|
||
See [velero].
|
||
|
||
## Managed Kubernetes Services
|
||
|
||
Most cloud providers offer their managed versions of Kubernetes. Check their websites:
|
||
|
||
- [Azure Kubernetes Service]
|
||
|
||
### Best practices in cloud environments
|
||
|
||
All kubernetes clusters should:
|
||
|
||
- be created using **IaC** ([terraform], [pulumi]);
|
||
- have different Node Pools dedicated to different workloads;
|
||
- have at least one Node Pool composed by **non-preemptible** dedicated to critical services like Admission Controller
|
||
Webhooks.
|
||
|
||
Each Node Pool should:
|
||
|
||
- have a _meaningful_ **name** (like `<prefix…>-<workload_type>-<random_id>`) to make it easy to recognize the workloads
|
||
running on it or the features of the Nodes in it;
|
||
- have a _minimum_ set of _meaningful_ **labels**, like:
|
||
- cloud provider information;
|
||
- Node information and capabilities;
|
||
- sparse Nodes on multiple **availability zones**.
|
||
|
||
## Edge computing
|
||
|
||
If planning to run Kubernetes on a Raspberry Pi, see [k3s] and the
|
||
[Build your very own self-hosting platform with Raspberry Pi and Kubernetes] series of articles.
|
||
|
||
## Troubleshooting
|
||
|
||
### Golang applications have trouble performing as expected
|
||
|
||
Also see [Container CPU Requests & Limits Explained with GOMAXPROCS Tuning].
|
||
|
||
By default, Golang sets the `GOMAXPROCS` environment variable (the number of OS threads for Go code execution) **to the
|
||
number of available CPUs on the Node running the Pod**.<br/>
|
||
This is **different** from the amount of resources the Pod is allocated when a CPU limit is set in the Pod's
|
||
specification, and the Go scheduler might try to run more or less threads than the application has CPU time for.
|
||
|
||
Properly set the `GOMAXPROCS` environment variable in the Pod's specification to match the limits imposed to the
|
||
Pod.<br/>
|
||
If the CPU limit is less than `1000m` (1 CPU core), set `GOMAXPROCS=1`.
|
||
|
||
An easy way to do this is to reference the environment variable's value from other fields.<br/>
|
||
Refer [Expose Pod Information to Containers Through Environment Variables].
|
||
|
||
<details style='padding-left: 1rem'>
|
||
|
||
```yml
|
||
apiVersion: v1
|
||
kind: Pod
|
||
spec:
|
||
containers:
|
||
- env:
|
||
- name: GOMAXPROCS
|
||
valueFrom:
|
||
resourceFieldRef:
|
||
resource: limits.cpu
|
||
divisor: "1" # quantity resource core, canonicalizes value to X digits - '1': 2560m -> 3
|
||
resources:
|
||
limits:
|
||
cpu: 2560m
|
||
```
|
||
|
||
</details>
|
||
|
||
### Recreate Pods upon ConfigMap's or Secret's content change
|
||
|
||
Use a checksum annotation to do the trick:
|
||
|
||
```yaml
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
spec:
|
||
template:
|
||
metadata:
|
||
annotations:
|
||
checksum/configmap: {{ include (print $.Template.BasePath "/configmap.yaml") $ | sha256sum }}
|
||
checksum/secret: {{ include (print $.Template.BasePath "/secret.yaml") $ | sha256sum }}
|
||
{{- if .podAnnotations }}
|
||
{{- toYaml .podAnnotations | trim | nindent 8 }}
|
||
{{- end }}
|
||
```
|
||
|
||
### Run a command in a Pod right after its initialization
|
||
|
||
Use a container's `lifecycle.postStart.exec.command` spec:
|
||
|
||
```yaml
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
metadata:
|
||
name: my-deployment
|
||
spec:
|
||
template:
|
||
…
|
||
spec:
|
||
containers:
|
||
- name: my-container
|
||
…
|
||
lifecycle:
|
||
postStart:
|
||
exec:
|
||
command: ["/bin/sh", "-c", "echo 'heeeeeeey yaaaaaa!'"]
|
||
```
|
||
|
||
### Run a command just before a Pod stops
|
||
|
||
Leverage the `preStop` hook instead of `postStart`.
|
||
|
||
> Hooks **are not passed parameters**, and this includes environment variables
|
||
> Use a script if you need them. See [container hooks] and [preStop hook doesn't work with env variables]
|
||
|
||
Since kubernetes version 1.9 and forth, volumeMounts behavior on secret, configMap, downwardAPI and projected have
|
||
changed to Read-Only by default.
|
||
A workaround to the problem is to create an `emptyDir` Volume and copy the contents into it and execute/write whatever
|
||
you need:
|
||
|
||
```yaml
|
||
initContainers:
|
||
- name: copy-ro-scripts
|
||
image: busybox
|
||
command: ['sh', '-c', 'cp /scripts/* /etc/pre-install/']
|
||
volumeMounts:
|
||
- name: scripts
|
||
mountPath: /scripts
|
||
- name: pre-install
|
||
mountPath: /etc/pre-install
|
||
volumes:
|
||
- name: pre-install
|
||
emptyDir: {}
|
||
- name: scripts
|
||
configMap:
|
||
name: bla
|
||
```
|
||
|
||
## Examples
|
||
|
||
### Create an admission webhook
|
||
|
||
See the example's [README][create an admission webhook].
|
||
|
||
## Further readings
|
||
|
||
Usage:
|
||
|
||
- [Official documentation][documentation]
|
||
- [Configure a Pod to use a ConfigMap]
|
||
- [Distribute credentials securely using Secrets]
|
||
- [Configure a Security Context for a Pod or a Container]
|
||
- [Set capabilities for a Container]
|
||
- [Using `sysctl`s in a Kubernetes Cluster][using sysctls in a kubernetes cluster]
|
||
|
||
Concepts:
|
||
|
||
- [Namespaces]
|
||
- [Container hooks]
|
||
- Kubernetes' [security context design proposal]
|
||
- Kubernetes' [No New Privileges Design Proposal]
|
||
- [Linux kernel documentation about `no_new_privs`][no_new_privs linux kernel documentation]
|
||
- [Linux capabilities]
|
||
- [Runtime privilege and Linux capabilities in Docker containers]
|
||
- [Container capabilities in Kubernetes]
|
||
- [Kubernetes SecurityContext Capabilities Explained]
|
||
- [Best practices for Pod security in Azure Kubernetes Service (AKS)]
|
||
- [Network policies]
|
||
|
||
Distributions:
|
||
|
||
- [K3S]
|
||
- [RKE2]
|
||
- [K0S]
|
||
|
||
Tools:
|
||
|
||
- [`kubectl`][kubectl]
|
||
- [`helm`][helm]
|
||
- [`helmfile`][helmfile]
|
||
- [`kustomize`][kustomize]
|
||
- [`kubeval`][kubeval]
|
||
- `kube-score`
|
||
- [`kubectx`+`kubens`][kubectx+kubens], alternative to [`kubie`][kubie] and [`kubeswitch`][kubeswitch]
|
||
- [`kubeswitch`][kubeswitch], alternative to [`kubie`][kubie] and [`kubectx`+`kubens`][kubectx+kubens]
|
||
- [`kube-ps1`][kube-ps1]
|
||
- [`kubie`][kubie], alternative to [`kubeswitch`][kubeswitch], and to [`kubectx`+`kubens`][kubectx+kubens] and
|
||
[`kube-ps1`][kube-ps1]
|
||
- [Minikube]
|
||
- [Kubescape]
|
||
|
||
Add-ons of interest:
|
||
|
||
- [Certmanager][cert-manager]
|
||
- [ExternalDNS][external-dns]
|
||
- [Flux]
|
||
- [Istio]
|
||
- [KEDA]
|
||
- [k8s-ephemeral-storage-metrics]
|
||
|
||
Others:
|
||
|
||
- The [Build your very own self-hosting platform with Raspberry Pi and Kubernetes] series of articles
|
||
- [Why separate your Kubernetes workload with nodepool segregation and affinity options]
|
||
- [RBAC.dev]
|
||
- [Scaling Kubernetes to 7,500 nodes]
|
||
|
||
### Sources
|
||
|
||
- Kubernetes' [concepts]
|
||
- [How to run a command in a Pod after initialization]
|
||
- [Making sense of Taints and Tolerations]
|
||
- [Read-only filesystem error]
|
||
- [preStop hook doesn't work with env variables]
|
||
- [Configure Quality of Service for Pods]
|
||
- [Version skew policy]
|
||
- [Labels and Selectors]
|
||
- [Recommended Labels]
|
||
- [Configure Liveness, Readiness and Startup Probes]
|
||
- [Configuration best practices]
|
||
- [Cloudzero Kubernetes best practices]
|
||
- [Scaling K8S nodes without breaking the bank or your sanity - Brandon Wagner & Nick Tran, Amazon]
|
||
- [Kubernetes Troubleshooting - The Complete Guide]
|
||
- [Kubernetes cluster autoscaler]
|
||
- [Common labels]
|
||
- [What is Kubernetes?]
|
||
- [Using RBAC Authorization]
|
||
- [Expose Pod information to Containers through files]
|
||
- [Avoiding Kubernetes Pod Topology Spread Constraint Pitfalls]
|
||
- [Kubernetes Complete Hands‑On Guides]
|
||
|
||
<!--
|
||
Reference
|
||
═╬═Time══
|
||
-->
|
||
|
||
<!-- In-article sections -->
|
||
[horizontal pod autoscaler]: #horizontal-pod-autoscaler
|
||
[vertical pod autoscaler]: #vertical-pod-autoscaler
|
||
[pods]: #pods
|
||
[privileged container vs privilege escalation]: #privileged-container-vs-privilege-escalation
|
||
|
||
<!-- Knowledge base -->
|
||
[azure kubernetes service]: ../cloud%20computing/azure/aks.md
|
||
[cert-manager]: cert-manager.md
|
||
[cluster autoscaler]: cluster%20autoscaler.md
|
||
[connection tracking]: ../connection%20tracking.placeholder
|
||
[create an admission webhook]: ../../examples/kubernetes/create%20an%20admission%20webhook/README.md
|
||
[etcd]: ../etcd.md
|
||
[external-dns]: external-dns.md
|
||
[flux]: flux.md
|
||
[gitops]: ../gitops.md
|
||
[helm]: helm.md
|
||
[helmfile]: helmfile.md
|
||
[istio]: istio.md
|
||
[k0s]: k0s.placeholder
|
||
[k3s]: k3s.md
|
||
[karpenter]: karpenter.md
|
||
[keda]: keda.md
|
||
[kubectl]: kubectl.md
|
||
[kubescape]: kubescape.md
|
||
[kubeval]: kubeval.md
|
||
[kustomize]: kustomize.md
|
||
[metrics server]: metrics%20server.md
|
||
[minikube]: minikube.md
|
||
[network policies]: network%20policies.md
|
||
[pulumi]: ../pulumi.md
|
||
[rke2]: rke2.md
|
||
[terraform]: ../terraform.md
|
||
[velero]: velero.md
|
||
|
||
<!-- Upstream -->
|
||
[addons]: https://kubernetes.io/docs/concepts/cluster-administration/addons/
|
||
[Affinity and anti-affinity]: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity
|
||
[api deprecation policy]: https://kubernetes.io/docs/reference/using-api/deprecation-policy/
|
||
[Assigning Pods to Nodes]: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/
|
||
[common labels]: https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
|
||
[concepts]: https://kubernetes.io/docs/concepts/
|
||
[configuration best practices]: https://kubernetes.io/docs/concepts/configuration/overview/
|
||
[configure a pod to use a configmap]: https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/
|
||
[configure a security context for a pod or a container]: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/
|
||
[configure liveness, readiness and startup probes]: https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/
|
||
[configure quality of service for pods]: https://kubernetes.io/docs/tasks/configure-pod-container/quality-service-pod/
|
||
[container hooks]: https://kubernetes.io/docs/concepts/containers/container-lifecycle-hooks/#container-hooks
|
||
[distribute credentials securely using secrets]: https://kubernetes.io/docs/tasks/inject-data-application/distribute-credentials-secure/
|
||
[documentation]: https://kubernetes.io/docs/home/
|
||
[Expose Pod Information to Containers Through Environment Variables]: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/
|
||
[expose pod information to containers through files]: https://kubernetes.io/docs/tasks/inject-data-application/downward-api-volume-expose-pod-information/
|
||
[Horizontal Pod Autoscaling]: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/
|
||
[HorizontalPodAutoscaler Walkthrough]: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/
|
||
[labels and selectors]: https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/
|
||
[namespaces]: https://kubernetes.io/docs/concepts/overview/working-with-objects/namespaces/
|
||
[no new privileges design proposal]: https://github.com/kubernetes/design-proposals-archive/blob/main/auth/no-new-privs.md
|
||
[path segment names]: https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#path-segment-names
|
||
[Pod Topology Spread Constraints]: https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/
|
||
[production best practices checklist]: https://learnk8s.io/production-best-practices
|
||
[recommended labels]: https://kubernetes.io/docs/concepts/overview/working-with-objects/common-labels/
|
||
[resource management for pods and containers]: https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
|
||
[security context design proposal]: https://github.com/kubernetes/design-proposals-archive/blob/main/auth/security_context.md
|
||
[security design proposal]: https://github.com/kubernetes/design-proposals-archive/blob/main/auth/security.md
|
||
[set capabilities for a container]: https://kubernetes.io/docs/tasks/configure-pod-container/security-context/#set-capabilities-for-a-container
|
||
[Taints and Tolerations]: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
|
||
[using rbac authorization]: https://kubernetes.io/docs/reference/access-authn-authz/rbac/
|
||
[using sysctls in a kubernetes cluster]: https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/
|
||
[version skew policy]: https://kubernetes.io/releases/version-skew-policy/
|
||
[volumes]: https://kubernetes.io/docs/concepts/storage/volumes/
|
||
|
||
<!-- Others -->
|
||
[Avoiding Kubernetes Pod Topology Spread Constraint Pitfalls]: https://medium.com/wise-engineering/avoiding-kubernetes-pod-topology-spread-constraint-pitfalls-d369bb04689e
|
||
[best practices for pod security in azure kubernetes service (aks)]: https://learn.microsoft.com/en-us/azure/aks/developer-best-practices-pod-security
|
||
[build your very own self-hosting platform with raspberry pi and kubernetes]: https://kauri.io/build-your-very-own-self-hosting-platform-with-raspberry-pi-and-kubernetes/5e1c3fdc1add0d0001dff534/c
|
||
[cloudzero kubernetes best practices]: https://www.cloudzero.com/blog/kubernetes-best-practices
|
||
[cncf]: https://www.cncf.io/
|
||
[container capabilities in kubernetes]: https://unofficial-kubernetes.readthedocs.io/en/latest/concepts/policy/container-capabilities/
|
||
[Container CPU Requests & Limits Explained with GOMAXPROCS Tuning]: https://victoriametrics.com/blog/kubernetes-cpu-go-gomaxprocs/
|
||
[elasticsearch]: https://github.com/elastic/helm-charts/issues/689
|
||
[how to run a command in a pod after initialization]: https://stackoverflow.com/questions/44140593/how-to-run-command-after-initialization/44146351#44146351
|
||
[HPA not scaling down]: https://stackoverflow.com/questions/65704583/hpa-not-scaling-down#65770916
|
||
[k8s-ephemeral-storage-metrics]: https://github.com/jmcgrath207/k8s-ephemeral-storage-metrics
|
||
[kube-ps1]: https://github.com/jonmosco/kube-ps1
|
||
[kubectx+kubens]: https://github.com/ahmetb/kubectx
|
||
[kubernetes cluster autoscaler]: https://www.kubecost.com/kubernetes-autoscaling/kubernetes-cluster-autoscaler/
|
||
[Kubernetes Complete Hands‑On Guides]: https://github.com/anveshmuppeda/kubernetes
|
||
[kubernetes securitycontext capabilities explained]: https://www.golinuxcloud.com/kubernetes-securitycontext-capabilities/
|
||
[kubernetes troubleshooting - the complete guide]: https://komodor.com/learn/kubernetes-troubleshooting-the-complete-guide/
|
||
[kubeswitch]: https://github.com/danielfoehrKn/kubeswitch
|
||
[kubie]: https://github.com/sbstp/kubie
|
||
[linux capabilities]: https://man7.org/linux/man-pages/man7/capabilities.7.html
|
||
[making sense of taints and tolerations]: https://medium.com/kubernetes-tutorials/making-sense-of-taints-and-tolerations-in-kubernetes-446e75010f4e
|
||
[no_new_privs linux kernel documentation]: https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt
|
||
[prestop hook doesn't work with env variables]: https://stackoverflow.com/questions/61929055/kubernetes-prestop-hook-doesnt-work-with-env-variables#62135231
|
||
[rbac.dev]: https://rbac.dev/
|
||
[read-only filesystem error]: https://stackoverflow.com/questions/49614034/kubernetes-deployment-read-only-filesystem-error/51478536#51478536
|
||
[runtime privilege and linux capabilities in docker containers]: https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities
|
||
[scaling k8s nodes without breaking the bank or your sanity - brandon wagner & nick tran, amazon]: https://www.youtube.com/watch?v=UBb8wbfSc34
|
||
[scaling kubernetes to 7,500 nodes]: https://openai.com/index/scaling-kubernetes-to-7500-nodes/
|
||
[what is kubernetes?]: https://www.youtube.com/watch?v=a2gfpZE8vXY
|
||
[why separate your kubernetes workload with nodepool segregation and affinity options]: https://medium.com/contino-engineering/why-separate-your-kubernetes-workload-with-nodepool-segregation-and-affinity-rules-cb5225953788
|