opinionated k8s deployment

an opionionated and commented deployment manifest for generic apps

SEAN K.H. LIAO

opinionated k8s deployment

an opionionated and commented deployment manifest for generic apps

kubernetes manifests

YAML engineer reporting in.

note: yaml is long and repetitive, I'm still not sure if I'm happy I introduced yaml anchors to my team. tldr, the 2 docs below are equivalent, anchors do not carry across documents (---):

name: &name foo
somewhere:
  else:
    x: *name
---
name: foo
somewhere:
  else:
    x: foo

metadata

Every object has them: names, labels, annotations. They even have a recommended set of labels

metadata:
  name: foo
  annotations:
    # stick values that you don't want to filter by here,
    # such as info for other apps that read service definitions
    # or as a place to store data to make your controller stateless
  labels:
    # sort of duplicates metadata.name
    app.kubernetes.io/name: foo

    # separate multiple instances, not really necessary if you do app-per-namespace
    app.kubernetes.io/instance: default

    # you might not want to add this on everything (eg namespaces, security stuff)
    # since with least privilege you can't change them
    # and they don't really change that often(?)
    app.kubernetes.io/version: "1.2.3"

    # the hardest part is probably getting it to not say "helm" when you don't actually use helm
    app.kubernetes.io/managed-by: helm

    # these two aren't really necessary for single deployment apps
    #
    # the general purpose of "name", eg name=envoy component=proxy
    app.kubernetes.io/component: server
    # what the entire this is
    app.kubernetes.io/part-of: website

namespace

The hardest part about namespaces is your namespace allocation policy, do you:

Hierarchical Namespaces might help a bit, making the latter ones more tenable but still, things to think about.

Currently I'm in the "each app their own namespace" camp, and live with the double names in service addresses

apiVersion: v1
kind: Namespace
metadata:
  name: foo

ingress

The least common denominator of L4/L7 routing...

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: foo
spec:
  # for if you run multiple ingress controllers
  ingressClassName: default

  rules:
      # DNS style wildcards only
    - host: "*.example.com"
      http:
        paths:
          - path: /
            pathType: Prefix # or Exact, prefix uses path segment matching
            backend:
              service:
                name: foo
                port:
                  name: http
                  # number: 80

  tls:
    secretName: foo-tls
    hosts:
      - "*.example.com"

service

apiVersion: v1
kind: Service
metadata:
  name: foo
spec:
  # change as needed
  type: ClusterIP

  # only for type LoadBalancer
  externalTrafficPolicy: Local

  # for statefulsets that need peer discovery,
  # eg. etcd or cockroachdb
  publishNotReadyAddresses: true

  ports:
    - appProtocol: opentelemetry
      name: otlp
      port: 4317
      protocol: TCP
      targetPort: otlp # name or number, defaults to port

  selector:
    # these 2 should be enough to uniquely identify apps,
    # note this value cannot change once created
    app.kubernetes.io/name: foo
    app.kubernetes.io/instance: default

serviceaccount

note: while it does have a spec.secrets field, it currently doesn't really do anything useful.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: foo
  annotations:
    # workload identity for attaching to GCP service accounts in GKE
    iam.gke.io/gcp-service-account: GSA_NAME@PROJECT_ID.iam.gserviceaccount.com

app

deployment

Use only if you app is truly stateless: no PersistentVolumeClaims unless it's ReadOnlyMany, even then PVCs still restrict the nodes you can run on.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: foo
spec:
  # don't set if you plan on autoscaling
  replicas: 1

  # stop cluttering kubectl get all with old replicasets,
  # your gitops tooling should let you roll back
  revisionHistoryLimit: 3

  selector:
    matchLabels:
      # these 2 should be enough to uniquely identify apps,
      # note this value cannot change once created
      app.kubernetes.io/name: foo
      app.kubernetes.io/instance: default

  # annoyingly named differently from StatefulSet or DaemonSet
  strategy:
    # prefer maxSurge to keep availability during upgrades / migrations
    rollingUpdate:
      maxSurge: 25% # rounds up
      maxUnavailable: 0

    # Recreate if you want blue-green style
    # or if you're stuck with a PVC
    type: RollingUpdate

  template: # see pod below
statefulset

If your app has any use for persistent data, use this, even if you only have a single instance. Also gives you nice DNS names per pod.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: foo
spec:
  # or Parallel for all at once
  podManagementPolicy: OrderedReady
  replicas: 3

  # stop cluttering kubectl get all with old replicasets,
  # your gitops tooling should let you roll back
  revisionHistoryLimit: 3

  selector:
    matchLabels:
      # these 2 should be enough to uniquely identify apps,
      # note this value cannot change once created
      app.kubernetes.io/name: foo
      app.kubernetes.io/instance: default

  # even though they say it must exist, it doesn't have to
  # (but you lose per pod DNS)
  serviceName: foo

  template: # see pod below

  updateStrategy:
    rollingUpdate: # this should only be used by tooling
    type: RollingUpdate

  volumeClaimTemplates: # see pvc below
daemonset
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: foo
spec:
  # stop cluttering kubectl get all with old replicasets,
  # your gitops tooling should let you roll back
  revisionHistoryLimit: 3

  selector:
    matchLabels:
      # these 2 should be enough to uniquely identify apps,
      # note this value cannot change once created
      app.kubernetes.io/name: foo
      app.kubernetes.io/instance: default

  template: # see pod below

  updateStrategy:
    rollingUpdate:
      # make it faster for large clusters
      maxUnavailable: 30%
    type: RollingUpdate
pod
apiVersion: v1
kind: Pod
metadata:
  name: foo
spec:
  containers:
    - name: foo
      args:
        - -flag1=v1
        - -flag2=v2
      envFrom
        - configMapRef:
            name: foo-env
            optional: true
          prefix: APP_
      image: docker.example.com/app:v1
      imagePullPolicy: IfNotPresent

      ports:
        - containerPort: 4317
          name: otlp
          protocol: TCP

      # do extra stuff
      lifecycle:
        postStart:
        preStop:

      startupProbe: # allow a longer startup
      livenessProbe: # stay alive to not get killed
      readinessProbe: # stay alive to route traffic

      securityContext:
        allowPrivilegeEscalation: false
        capabilities:
          add:
            - CAP_NET_ADMIN
        privileged: false
        readOnlyRootFilesystem: true

      resources:
        # ideally set after running some time and profiling actual usage,
        # prefer to start high and rachet down
        requests:
          cpu: 500m
          memory: 128Mi
        limits:
          cpu: 1500m
          memory: 512Mi

      volumeMounts: # as needed

  # don't inject env with service addresses/ports
  # not many things use them, they clutter up the env
  # and may be a performance hit with large number of services
  enableServiceLinks: false

  # do create PriorityClasses and every pod one,
  # helps with deciding which pods to kill first
  priorityClassName: critical

  securityContext:
    fsGroup: 65535
    runAsGroup: 65535
    runAsNonRoot: true
    runAsUser: 65535 # may conflict with container setting and need for $HOME

  serviceAccountName: foo

  terminationGracePeriodSeconds: 30

  volumes: # set as needed
scheduling

theres is some overlap in managing pod scheduling, especially around where they run:

apiVersion: v1
kind: Pod
metadata:
  name: foo
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms: # OR
          # has to be pool-0
          - matchExpressions: # AND
              - key: cloud.google.com/gke-nodepool
                operator: In
                values:
                  - pool-0
      preferredDuringSchedulingIgnoredDuringExecution
        # prefer zone us-central1-a
        - weight: 25
          preference:
            - matchExpressions: # AND
              - key: topology.kubernetes.io/zone
                operator: In
                values:
                  - us-central1-a

    podAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        # prefer to be on the same node as a bar
        - weight: 25
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app.kubernetes.io/name: bar
                app.kubernetes.io/instance: default
            topologyKey: kubernetes.io/hostname

    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution: # AND
        # never schedule in the same region as buzz
        - labelSelector:
            matchLabels:
              app.kubernetes.io/name: buzz
              app.kubernetes.io/instance: default
          topologyKey: topology.kubernetes.io/region


  topologySpreadConstraints: # AND
    # limit to 1 instance per node
    - maxSkew: 1
      labelSelector:
        matchLabels:
          app.kubernetes.io/name: foo
          app.kubernetes.io/instance: default
      topologyKey: kubernetes.io/hostname
      whenUnsatisfiable: DoNotSchedule # or ScheduleAnyway
command

persistentvolumeclaim

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: foo
spec:
  accessModes: ReadWriteOnce # ReadOnlyMany or ReadWriteMany (rare)

  dataSource: # prepopulate with data from a VolumeSnapshot or PersistentVolumeClaim

  resources:
    requests:
      storage: 10Gi

  # bind to existing PV
  selector: matchLabels

  storageClassName: ssd

  volumeMode: Filesystem # or Block

horizontalpodautoscaler

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: foo
spec:
  behavior: # fine tune when to scale up / down

  maxReplicas: 5
  minReplicas: 1

  metrics:
    -  # TODO

  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: foo

poddisruptionbudget

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: foo
spec:
  # when you have a low number of replicas
  # ensure you can disrupt them
  maxUnavailable: 1

  # allows for more disruptions
  minAvailable: 75%

  selector:
    matchLabels:
      app.kubernetes.io/name: foo
      app.kubernetes.io/instance: default