SEANK.H.LIAO

opinionated k8s deployment

an opionionated and commented deployment manifest for generic apps

kubernetes manifests

YAML engineer reporting in.

note: yaml is long and repetitive, I'm still not sure if I'm happy I introduced yaml anchors to my team. tldr, the 2 docs below are equivalent, anchors do not carry across documents (---):

1name: &name foo
2somewhere:
3  else:
4    x: *name
5---
6name: foo
7somewhere:
8  else:
9    x: foo

metadata

Every object has them: names, labels, annotations. They even have a recommended set of labels

 1metadata:
 2  name: foo
 3  annotations:
 4    # stick values that you don't want to filter by here,
 5    # such as info for other apps that read service definitions
 6    # or as a place to store data to make your controller stateless
 7  labels:
 8    # sort of duplicates metadata.name
 9    app.kubernetes.io/name: foo
10
11    # separate multiple instances, not really necessary if you do app-per-namespace
12    app.kubernetes.io/instance: default
13
14    # you might not want to add this on everything (eg namespaces, security stuff)
15    # since with least privilege you can't change them
16    # and they don't really change that often(?)
17    app.kubernetes.io/version: "1.2.3"
18
19    # the hardest part is probably getting it to not say "helm" when you don't actually use helm
20    app.kubernetes.io/managed-by: helm
21
22    # these two aren't really necessary for single deployment apps
23    #
24    # the general purpose of "name", eg name=envoy component=proxy
25    app.kubernetes.io/component: server
26    # what the entire this is
27    app.kubernetes.io/part-of: website

namespace

The hardest part about namespaces is your namespace allocation policy, do you:

Hierarchical Namespaces might help a bit, making the latter ones more tenable but still, things to think about.

Currently I'm in the "each app their own namespace" camp, and live with the double names in service addresses

1apiVersion: v1
2kind: Namespace
3metadata:
4  name: foo

ingress

The least common denominator of L4/L7 routing...

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: foo
spec:
  # for if you run multiple ingress controllers
  ingressClassName: default

  rules:
      # DNS style wildcards only
    - host: "*.example.com"
      http:
        paths:
          - path: /
            pathType: Prefix # or Exact, prefix uses path segment matching
            backend:
              service:
                name: foo
                port:
                  name: http
                  # number: 80

  tls:
    secretName: foo-tls
    hosts:
      - "*.example.com"

service

 1apiVersion: v1
 2kind: Service
 3metadata:
 4  name: foo
 5spec:
 6  # change as needed
 7  type: ClusterIP
 8
 9  # only for type LoadBalancer
10  externalTrafficPolicy: Local
11
12  # for statefulsets that need peer discovery,
13  # eg. etcd or cockroachdb
14  publishNotReadyAddresses: true
15
16  ports:
17    - appProtocol: opentelemetry
18      name: otlp
19      port: 4317
20      protocol: TCP
21      targetPort: otlp # name or number, defaults to port
22
23  selector:
24    # these 2 should be enough to uniquely identify apps,
25    # note this value cannot change once created
26    app.kubernetes.io/name: foo
27    app.kubernetes.io/instance: default

serviceaccount

note: while it does have a spec.secrets field, it currently doesn't really do anything useful.

1apiVersion: v1
2kind: ServiceAccount
3metadata:
4  name: foo
5  annotations:
6    # workload identity for attaching to GCP service accounts in GKE
7    iam.gke.io/gcp-service-account: GSA_NAME@PROJECT_ID.iam.gserviceaccount.com

app

deployment

Use only if you app is truly stateless: no PersistentVolumeClaims unless it's ReadOnlyMany, even then PVCs still restrict the nodes you can run on.

 1apiVersion: apps/v1
 2kind: Deployment
 3metadata:
 4  name: foo
 5spec:
 6  # don't set if you plan on autoscaling
 7  replicas: 1
 8
 9  # stop cluttering kubectl get all with old replicasets,
10  # your gitops tooling should let you roll back
11  revisionHistoryLimit: 3
12
13  selector:
14    matchLabels:
15      # these 2 should be enough to uniquely identify apps,
16      # note this value cannot change once created
17      app.kubernetes.io/name: foo
18      app.kubernetes.io/instance: default
19
20  # annoyingly named differently from StatefulSet or DaemonSet
21  strategy:
22    # prefer maxSurge to keep availability during upgrades / migrations
23    rollingUpdate:
24      maxSurge: 25% # rounds up
25      maxUnavailable: 0
26
27    # Recreate if you want blue-green style
28    # or if you're stuck with a PVC
29    type: RollingUpdate
30
31  template: # see pod below
statefulset

If your app has any use for persistent data, use this, even if you only have a single instance. Also gives you nice DNS names per pod.

 1apiVersion: apps/v1
 2kind: StatefulSet
 3metadata:
 4  name: foo
 5spec:
 6  # or Parallel for all at once
 7  podManagementPolicy: OrderedReady
 8  replicas: 3
 9
10  # stop cluttering kubectl get all with old replicasets,
11  # your gitops tooling should let you roll back
12  revisionHistoryLimit: 3
13
14  selector:
15    matchLabels:
16      # these 2 should be enough to uniquely identify apps,
17      # note this value cannot change once created
18      app.kubernetes.io/name: foo
19      app.kubernetes.io/instance: default
20
21  # even though they say it must exist, it doesn't have to
22  # (but you lose per pod DNS)
23  serviceName: foo
24
25  template: # see pod below
26
27  updateStrategy:
28    rollingUpdate: # this should only be used by tooling
29    type: RollingUpdate
30
31  volumeClaimTemplates: # see pvc below
daemonset
 1apiVersion: apps/v1
 2kind: DaemonSet
 3metadata:
 4  name: foo
 5spec:
 6  # stop cluttering kubectl get all with old replicasets,
 7  # your gitops tooling should let you roll back
 8  revisionHistoryLimit: 3
 9
10  selector:
11    matchLabels:
12      # these 2 should be enough to uniquely identify apps,
13      # note this value cannot change once created
14      app.kubernetes.io/name: foo
15      app.kubernetes.io/instance: default
16
17  template: # see pod below
18
19  updateStrategy:
20    rollingUpdate:
21      # make it faster for large clusters
22      maxUnavailable: 30%
23    type: RollingUpdate
pod
 1apiVersion: v1
 2kind: Pod
 3metadata:
 4  name: foo
 5spec:
 6  containers:
 7    - name: foo
 8      args:
 9        - -flag1=v1
10        - -flag2=v2
11      envFrom
12        - configMapRef:
13            name: foo-env
14            optional: true
15          prefix: APP_
16      image: docker.example.com/app:v1
17      imagePullPolicy: IfNotPresent
18
19      ports:
20        - containerPort: 4317
21          name: otlp
22          protocol: TCP
23
24      # do extra stuff
25      lifecycle:
26        postStart:
27        preStop:
28
29      startupProbe: # allow a longer startup
30      livenessProbe: # stay alive to not get killed
31      readinessProbe: # stay alive to route traffic
32
33      securityContext:
34        allowPrivilegeEscalation: false
35        capabilities:
36          add:
37            - CAP_NET_ADMIN
38        privileged: false
39        readOnlyRootFilesystem: true
40
41      resources:
42        # ideally set after running some time and profiling actual usage,
43        # prefer to start high and rachet down
44        requests:
45          cpu: 500m
46          memory: 128Mi
47        limits:
48          cpu: 1500m
49          memory: 512Mi
50
51      volumeMounts: # as needed
52
53  # don't inject env with service addresses/ports
54  # not many things use them, they clutter up the env
55  # and may be a performance hit with large number of services
56  enableServiceLinks: false
57
58  # do create PriorityClasses and every pod one,
59  # helps with deciding which pods to kill first
60  priorityClassName: critical
61
62  securityContext:
63    fsGroup: 65535
64    runAsGroup: 65535
65    runAsNonRoot: true
66    runAsUser: 65535 # may conflict with container setting and need for $HOME
67
68  serviceAccountName: foo
69
70  terminationGracePeriodSeconds: 30
71
72  volumes: # set as needed
scheduling

theres is some overlap in managing pod scheduling, especially around where they run:

 1apiVersion: v1
 2kind: Pod
 3metadata:
 4  name: foo
 5spec:
 6  affinity:
 7    nodeAffinity:
 8      requiredDuringSchedulingIgnoredDuringExecution:
 9        nodeSelectorTerms: # OR
10          # has to be pool-0
11          - matchExpressions: # AND
12              - key: cloud.google.com/gke-nodepool
13                operator: In
14                values:
15                  - pool-0
16      preferredDuringSchedulingIgnoredDuringExecution
17        # prefer zone us-central1-a
18        - weight: 25
19          preference:
20            - matchExpressions: # AND
21              - key: topology.kubernetes.io/zone
22                operator: In
23                values:
24                  - us-central1-a
25
26    podAffinity:
27      preferredDuringSchedulingIgnoredDuringExecution:
28        # prefer to be on the same node as a bar
29        - weight: 25
30          podAffinityTerm:
31            labelSelector:
32              matchLabels:
33                app.kubernetes.io/name: bar
34                app.kubernetes.io/instance: default
35            topologyKey: kubernetes.io/hostname
36
37    podAntiAffinity:
38      requiredDuringSchedulingIgnoredDuringExecution: # AND
39        # never schedule in the same region as buzz
40        - labelSelector:
41            matchLabels:
42              app.kubernetes.io/name: buzz
43              app.kubernetes.io/instance: default
44          topologyKey: topology.kubernetes.io/region
45
46
47  topologySpreadConstraints: # AND
48    # limit to 1 instance per node
49    - maxSkew: 1
50      labelSelector:
51        matchLabels:
52          app.kubernetes.io/name: foo
53          app.kubernetes.io/instance: default
54      topologyKey: kubernetes.io/hostname
55      whenUnsatisfiable: DoNotSchedule # or ScheduleAnyway
command

persistentvolumeclaim

 1apiVersion: v1
 2kind: PersistentVolumeClaim
 3metadata:
 4  name: foo
 5spec:
 6  accessModes: ReadWriteOnce # ReadOnlyMany or ReadWriteMany (rare)
 7
 8  dataSource: # prepopulate with data from a VolumeSnapshot or PersistentVolumeClaim
 9
10  resources:
11    requests:
12      storage: 10Gi
13
14  # bind to existing PV
15  selector: matchLabels
16
17  storageClassName: ssd
18
19  volumeMode: Filesystem # or Block

horizontalpodautoscaler

 1apiVersion: autoscaling/v2beta2
 2kind: HorizontalPodAutoscaler
 3metadata:
 4  name: foo
 5spec:
 6  behavior: # fine tune when to scale up / down
 7
 8  maxReplicas: 5
 9  minReplicas: 1
10
11  metrics:
12    -  # TODO
13
14  scaleTargetRef:
15    apiVersion: apps/v1
16    kind: Deployment
17    name: foo

poddisruptionbudget

 1apiVersion: policy/v1beta1
 2kind: PodDisruptionBudget
 3metadata:
 4  name: foo
 5spec:
 6  # when you have a low number of replicas
 7  # ensure you can disrupt them
 8  maxUnavailable: 1
 9
10  # allows for more disruptions
11  minAvailable: 75%
12
13  selector:
14    matchLabels:
15      app.kubernetes.io/name: foo
16      app.kubernetes.io/instance: default