mirror of https://gitea.com/mcereda/oam.git synced 2026-02-09 05:44:23 +00:00

Files

Michele Cereda eb250de820 feat(gitlab/runners/k8s): improve examples adding affinity and improving resource management

2024-07-29 18:55:25 +02:00

17 KiB

Raw Blame History

Gitlab runner

TL;DR
Pull images from private AWS ECR registries
Runners on Kubernetes
Autoscaling
1. Docker Machine
Further readings
1. Sources

TL;DR

Installation

brew install 'gitlab-runner'
dnf install 'gitlab-runner'
docker pull 'gitlab/gitlab-runner'
helm --namespace 'gitlab' upgrade --install --create-namespace --version '0.64.1' --repo 'https://charts.gitlab.io' \
  'gitlab-runner' -f 'values.gitlab-runner.yml' 'gitlab-runner'

Usage

docker run --rm --name 'runner' 'gitlab/gitlab-runner:alpine-v13.6.0' --version

# `gitlab-runner exec` is deprecated and has been removed in 17.0. ┌П┐(ಠ_ಠ) Gitlab.
# See https://docs.gitlab.com/16.11/runner/commands/#gitlab-runner-exec-deprecated.
gitlab-runner exec docker 'job-name'
gitlab-runner exec docker \
  --env 'AWS_ACCESS_KEY_ID=AKIA…' --env 'AWS_SECRET_ACCESS_KEY=F…s' --env 'AWS_REGION=eu-east-1' \
  --env 'DOCKER_AUTH_CONFIG={ "credsStore": "ecr-login" }' \
  --docker-volumes "$HOME/.aws/credentials:/root/.aws/credentials:ro"
  'job-requiring-ecr-access'

Each runner executor is assigned 1 task at a time.

Pull images from private AWS ECR registries

Create an IAM Role in one's AWS account and attach it the arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly IAM policy.
Create and InstanceProfile using the above IAM Role.
Create an EC2 Instance.
Make it use the above InstanceProfile.
Install the Docker Engine and the Gitlab runner on the EC2 Instance.
Install the Amazon ECR Docker Credential Helper.
Configure an AWS Region in /root/.aws/config:
```
[default]
region = eu-west-1
```
Create the /root/.docker/config.json file and add the following line to it:
```
 {
   …
+ "credsStore": "ecr-login"
 }
```

Configure the runner to use the docker or docker+machine executor.

[[runners]]
executor = "docker"   # or "docker+machine"

Configure the runner to use the ECR Credential Helper:

[[runners]]
  [runners.docker]
  environment = [ 'DOCKER_AUTH_CONFIG={"credsStore":"ecr-login"}' ]

Configure jobs to use images saved in private AWS ECR registries:

phpunit:
  stage: testing
  image:
    name: 123456789123.dkr.ecr.eu-west-1.amazonaws.com/php-gitlabrunner:latest
    entrypoint: [""]
  script:
    - php ./vendor/bin/phpunit --coverage-text --colors=never

Now your GitLab runner should automatically authenticate to one's private ECR registry.

Runners on Kubernetes

Store tokens in secrets instead of putting the token in the chart's values.

Requirements:

A running and configured Gitlab instance.
A Kubernetes cluster.

Procedure:

[best practice] Create a dedicated namespace:
```
kubectl create namespace 'gitlab'
```
Create a runner in gitlab:
1. Go to one's Gitlab instance's /admin/runners page.
2. Click on the New instance runner button.
3. Keep Linux as runner type.
4. Click on the Create runner button.
5. Copy the runner's token.

(Re-)Create the runners' Kubernetes secret with the runners' token from the previous step:

kubectl delete --namespace 'gitlab' secret 'gitlab-runner-token' --ignore-not-found
kubectl create --namespace 'gitlab' secret generic 'gitlab-runner-token' \
  --from-literal='runner-registration-token=""' --from-literal='runner-token=glrt-…'

The secret's name must be matched in the helm chart's values file.

Install the helm chart:

helm --namespace 'gitlab' upgrade --install --repo 'https://charts.gitlab.io' \
  --values 'values.yaml' \
  'gitlab-runner' 'gitlab-runner'

[best practice] Be sure to match the runner version with the Gitlab server's:

helm search repo --versions 'gitlab/gitlab-runner'

Example helm chart values

gitlabUrl: https://gitlab.example.org/
unregisterRunners: true
concurrent: 20
checkInterval: 3
rbac:
  create: true
metrics:
  enabled: true
runners:
  config: |
    [[runners]]

      [runners.cache]
        Shared = true

      [runners.kubernetes]
        image = "alpine"
        pull_policy = [
          "if-not-present",
          "always"
        ]
        allowed_pull_policies = [
          "if-not-present",
          "always",
          "never"
        ]

        namespace = "{{.Release.Namespace}}"
  name: "runner-on-k8s"
  secret: gitlab-runner-token
affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
            - key: eks.amazonaws.com/capacityType
              operator: In
              values:
                - ON_DEMAND
tolerations:
  - key: app
    operator: Equal
    value: gitlab
  - key: component
    operator: Equal
    value: runner
podLabels:
  team: engineering

Gotchas:

The build, helper and multiple service containers will all reside in a single pod.
If the sum of the resources request by all of them is too high, it will not be scheduled and the pipeline will hang and fail.
If any pod is killed due to OOM, the pipeline that spawned it will hang until it times out.

Improvements:

Keep the manager pod on stable nodes.

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
            - key: eks.amazonaws.com/capacityType
              operator: In
              values:
                - ON_DEMAND

Dedicate specific nodes to runner executors.
Taint dedicated nodes and add tolerations and affinities to the runner's configuration.

[[runners]]
  [runners.kubernetes]

  [runners.kubernetes.node_selector]
    gitlab = "true"
    "kubernetes.io/arch" = "amd64"

    [runners.kubernetes.affinity]
      [runners.kubernetes.affinity.node_affinity]
        [runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution]
          [[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms]]
            [[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms.match_expressions]]
              key = "app"
              operator = "In"
              values = [ "gitlab-runner" ]
            [[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms.match_expressions]]
              key = "customLabel"
              operator = "In"
              values = [ "customValue" ]

          [[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution]]
            weight = 1

            [runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference]
              [[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference.match_expressions]]
                key = "eks.amazonaws.com/capacityType"
                operator = "In"
                values = [ "ON_DEMAND" ]

    [runners.kubernetes.node_tolerations]
      "app=gitlab-runner" = "NoSchedule"
      "node-role.kubernetes.io/master" = "NoSchedule"
      "custom.toleration=value" = "NoSchedule"
      "empty.value=" = "PreferNoSchedule"
      onlyKey = ""

Avoid massive resource consumption by defaulting to (very?) strict resource limits and 0 request.

[[runners]]
  [runners.kubernetes]
    cpu_request = "0"
    cpu_limit = "2"
    memory_request = "0"
    memory_limit = "2Gi"
    ephemeral_storage_request = "0"
    ephemeral_storage_limit = "512Mi"

    helper_cpu_request = "0"
    helper_cpu_limit = "0.5"
    helper_memory_request = "0"
    helper_memory_limit = "128Mi"
    helper_ephemeral_storage_request = "0"
    helper_ephemeral_storage_limit = "64Mi"

    service_cpu_request = "0"
    service_cpu_limit = "1"
    service_memory_request = "0"
    service_memory_limit = "0.5Gi"

Play nice and make sure to leave some space for the host's other workloads by allowing for resource request and limit override only up to a point.

[[runners]]
  [runners.kubernetes]
    cpu_limit_overwrite_max_allowed = "15"
    cpu_request_overwrite_max_allowed = "15"
    memory_limit_overwrite_max_allowed = "62Gi"
    memory_request_overwrite_max_allowed = "62Gi"
    ephemeral_storage_limit_overwrite_max_allowed = "49Gi"
    ephemeral_storage_request_overwrite_max_allowed = "49Gi"

    helper_cpu_limit_overwrite_max_allowed = "0.9"
    helper_cpu_request_overwrite_max_allowed = "0.9"
    helper_memory_limit_overwrite_max_allowed = "1Gi"
    helper_memory_request_overwrite_max_allowed = "1Gi"
    helper_ephemeral_storage_limit_overwrite_max_allowed = "1Gi"
    helper_ephemeral_storage_request_overwrite_max_allowed = "1Gi"

    service_cpu_limit_overwrite_max_allowed = "3.9"
    service_cpu_request_overwrite_max_allowed = "3.9"
    service_memory_limit_overwrite_max_allowed = "15.5Gi"
    service_memory_request_overwrite_max_allowed = "15.5Gi"
    service_ephemeral_storage_limit_overwrite_max_allowed = "15Gi"
    service_ephemeral_storage_request_overwrite_max_allowed = "15Gi"

Autoscaling

Docker Machine

Runner like any others, just configured to use the docker+machine executor.

Supported cloud providers.

Using this executor opens up specific configuration settings.

Pitfalls:

On AWS, the driver supports only one subnet (and hence 1 AZ) per runner.
See AWS driver does not support multiple non default subnets and Docker Machine's AWS driver's options.

Example configuration

# Number of jobs *in total* that can be run concurrently by *all* configured runners
# Does *not* affect the *total* upper limit of VMs created by *all* providers
concurrent = 40

[[runners]]
  name = "static-scaler"

  url = "https://gitlab.example.org"
  token = "abcdefghijklmnopqrst"

  executor = "docker+machine"
  environment = [ "AWS_REGION=eu-west-1" ]

  # Number of jobs that can be run concurrently by the VMs created by *this* runner
  # Defines the *upper limit* of how many VMs can be created by *this* runner, since it is 1 task per VM at a time
  limit = 10

  [runners.machine]
    # Static number of VMs that need to be idle at all times
    IdleCount = 0

    # Remove VMs after 5m in the idle state
    IdleTime = 300

    # Maximum number of VMs that can be added to this runner in parallel
    # Defaults to 0 (no limit)
    MaxGrowthRate = 1

    # Template for the VMs' names
    # Must contain '%s'
    MachineName = "static-ondemand-%s"

    MachineDriver = "amazonec2"
    MachineOptions = [
      # Refer the correct driver at 'https://gitlab.com/gitlab-org/ci-cd/docker-machine/-/tree/main/docs/drivers'
      "amazonec2-region=eu-west-1",
      "amazonec2-vpc-id=vpc-1234abcd",
      "amazonec2-zone=a",                              # driver limitation, only 1 allowed
      "amazonec2-subnet-id=subnet-0123456789abcdef0",  # subnet-id in the specified az
      "amazonec2-use-private-address=true",
      "amazonec2-private-address-only=true",
      "amazonec2-security-group=GitlabRunners",

      "amazonec2-instance-type=m6i.large",
      "amazonec2-root-size=50",
      "amazonec2-iam-instance-profile=GitlabRunnerEc2",
      "amazonec2-tags=Team,Infrastructure,Application,Gitlab Runner,SpotInstance,False",
    ]

[[runners]]
  name = "dynamic-scaler"
  executor = "docker+machine"
  limit = 40  # will still respect the global concurrency value

  [runners.machine]
    # With 'IdleScaleFactor' defined, this becomes the upper limit of VMs that can be idle at all times
    IdleCount = 10

    # *Minimum* number of VMs that need to be idle at all times when 'IdleScaleFactor' is defined
    # Defaults to 1; will be set automatically to 1 if set lower than that
    IdleCountMin = 1

    # Number of VMs that need to be idle at all times, as a factor of the number of machines in use
    # In this case: idle VMs = 1.0 * machines in use, min 1, max 10
    # Must be a floating point number
    # Defaults to 0.0
    IdleScaleFactor = 1.0

    IdleTime = 600

    # Remove VMs after 250 jobs
    # Keeps them fresh
    MaxBuilds = 250

    MachineName = "dynamic-spot-%s"
    MachineDriver = "amazonec2"
    MachineOptions = [
      # Refer the correct driver at 'https://gitlab.com/gitlab-org/ci-cd/docker-machine/-/tree/main/docs/drivers'
      "amazonec2-region=eu-west-1",
      "amazonec2-vpc-id=vpc-1234abcd",
      "amazonec2-zone=b",                              # driver limitation, only 1 allowed
      "amazonec2-subnet-id=subnet-abcdef0123456789a",  # subnet-id in the specified az
      "amazonec2-use-private-address=true",
      "amazonec2-private-address-only=true",
      "amazonec2-security-group=GitlabRunners",

      "amazonec2-instance-type=r7a.large",
      "amazonec2-root-size=25",
      "amazonec2-iam-instance-profile=GitlabRunnerEc2",
      "amazonec2-tags=Team,Infrastructure,Application,Gitlab Runner,SpotInstance,True",

      "amazonec2-request-spot-instance=true",
      "amazonec2-spot-price=0.3",
    ]

    # Pump up the volume of available VMs during working hours
    [[runners.machine.autoscaling]]
      Periods = ["* * 9-17 * * mon-fri *"] # Every work day between 9 and 18 Amsterdam time
      Timezone = "Europe/Amsterdam"

      IdleCount = 20
      IdleCountMin = 5
      IdleTime = 3600

      # In this case: idle VMs = 1.5 * machines in use, min 5, max 20
      IdleScaleFactor = 1.5

    # Reduce even more the number of available VMs during the weekends
    [[runners.machine.autoscaling]]
      Periods = ["* * * * * sat,sun *"]
      IdleCount = 0
      IdleTime = 120
      Timezone = "UTC"

17 KiB Raw Blame History