brandon/oam

Fork 0

mirror of https://gitea.com/mcereda/oam.git synced 2026-02-09 05:44:23 +00:00

Files

Michele Cereda 57a65b48c4 chore(gitlab): dump runner autoscaler installation and commands

2024-10-11 19:00:54 +02:00

27 KiB

Raw Blame History

Gitlab runner

TL;DR
Pull images from private AWS ECR registries
Executors
Autoscaling
Further readings
1. Sources

TL;DR

Installation

brew install 'gitlab-runner'
dnf install 'gitlab-runner'
docker pull 'gitlab/gitlab-runner'
helm --namespace 'gitlab' upgrade --install --create-namespace --version '0.64.1' --repo 'https://charts.gitlab.io' \
  'gitlab-runner' -f 'values.gitlab-runner.yml' 'gitlab-runner'

Usage

docker run --rm --name 'runner' 'gitlab/gitlab-runner:alpine-v13.6.0' --version

# `gitlab-runner exec` is deprecated and has been removed in 17.0. ┌П┐(ಠ_ಠ) Gitlab.
# See https://docs.gitlab.com/16.11/runner/commands/#gitlab-runner-exec-deprecated.
gitlab-runner exec docker 'job-name'
gitlab-runner exec docker \
  --env 'AWS_ACCESS_KEY_ID=AKIA…' --env 'AWS_SECRET_ACCESS_KEY=F…s' --env 'AWS_REGION=eu-east-1' \
  --env 'DOCKER_AUTH_CONFIG={ "credsStore": "ecr-login" }' \
  --docker-volumes "$HOME/.aws/credentials:/root/.aws/credentials:ro"
  'job-requiring-ecr-access'

Each runner executor is assigned 1 task at a time.

Runners seem to require the main instance to give the full certificate chain upon connection.

Pull images from private AWS ECR registries

Create an IAM Role in one's AWS account and attach it the arn:aws:iam::aws:policy/AmazonEC2ContainerRegistryReadOnly IAM policy.
Create and InstanceProfile using the above IAM Role.
Create an EC2 Instance.
Make it use the above InstanceProfile.
Install the Docker Engine and the Gitlab runner on the EC2 Instance.
Install the Amazon ECR Docker Credential Helper.
Configure an AWS Region in /root/.aws/config:
```
[default]
region = eu-west-1
```
Create the /root/.docker/config.json file and add the following line to it:
```
 {
   …
+ "credsStore": "ecr-login"
 }
```

Configure the runner to use the docker or docker+machine executor.

[[runners]]
executor = "docker"   # or "docker+machine"

Configure the runner to use the ECR Credential Helper:

[[runners]]
  [runners.docker]
  environment = [ 'DOCKER_AUTH_CONFIG={"credsStore":"ecr-login"}' ]

Configure jobs to use images saved in private AWS ECR registries:

phpunit:
  stage: testing
  image:
    name: 123456789123.dkr.ecr.eu-west-1.amazonaws.com/php-gitlabrunner:latest
    entrypoint: [""]
  script:
    - php ./vendor/bin/phpunit --coverage-text --colors=never

Now your GitLab runner should automatically authenticate to one's private ECR registry.

Executors

Docker Autoscaler executor

Refer Docker Autoscaler executor.

Autoscale-enabled wrap for the Docker executor that creates instances on-demand to accommodate jobs processed by the runner manager.

Leverages fleeting plugins to scale automatically.
Fleeting is an abstraction for a group of autoscaled instances, and uses plugins supporting cloud providers.

Add the following settings in the config.toml file:

[[runners]]
  executor = "docker-autoscaler"

  [runners.docker]
    image = "busybox:latest"  # or whatever

  [runners.autoscaler]
    plugin = "aws:latest"  # or 'googlecloud' or 'azure' or whatever

    [runners.autoscaler.plugin_config]
      name = "…"  # see plugin docs

    [[runners.autoscaler.policy]]
      idle_count = 5
      idle_time = "20m0s"

Example: AWS, 1 instance per job, 5 idle instances for 20min.

Give each job a dedicated instance.
As soon as the job completes, the instance is immediately deleted.

Try to keep 5 whole instances available for future demand.
Idle instances stay available for at least 20 minutes.

Requirements:

An EC2 instance with Docker Engine to act as manager.
A Launch Template referencing an AMI equipped with Docker Engine for the runners to use.

Alternatively, any AMI that can run Docker Engine can be used as long as an appropriate cloud-init configuration is provided in the template's userData.
Specifically, the user executing Docker (by default, the instance's default user) must be part of the docker group in order to be able to access Docker's socket.
```
packages: [ "docker" ]
runcmd:
  - systemctl daemon-reload
  - systemctl enable --now docker.service
  - grep docker /etc/group -q && usermod -a -G docker ec2-user
```
An AutoScaling Group with the following setting:
- Minimum capacity = 0.
- Desired capacity = 0.
The runner will take care of scaling up and down.

An IAM Policy granting the manager instance the permissions needed to scale the ASG.
Refer the Recommended IAM Policy.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAsgDiscovering",
      "Effect": "Allow",
      "Action": [
        "autoscaling:DescribeAutoScalingGroups",
        "ec2:DescribeInstances"
      ],
      "Resource": "*"
    },
    {
      "Sid": "AllowAsgScaling",
      "Effect": "Allow",
      "Action": [
        "autoscaling:SetDesiredCapacity",
        "autoscaling:TerminateInstanceInAutoScalingGroup"
      ],
      "Resource": "arn:aws:autoscaling:eu-west-1:012345678901:autoScalingGroup:01234567-abcd-0123-abcd-0123456789ab:autoScalingGroupName/runners-autoscalingGroup"
    },
    {
      "Sid": "AllowManagingAccessToAsgInstances",
      "Effect": "Allow",
      "Action": "ec2-instance-connect:SendSSHPublicKey",
      "Resource": "arn:aws:ec2:eu-west-1:012345678901:instance/*",
      "Condition": {
        "StringEquals": {
          "ec2:ResourceTag/aws:autoscaling:groupName": "runners-autoscalingGroup"
        }
      }
    }
  ]
}

[if needed] The amazon ecr docker credential helper installed on the manager instance.

[if needed] An IAM Policy granting the manager instance the permissions needed to pull images from ECRs.

{
  "Version": "2012-10-17",
  "Statement": [
    {
        Sid: "AllowAuthenticatingWithEcr",
        Effect: "Allow",
        Action: "ecr:GetAuthorizationToken",
        Resource: "*",
    },
    {
        Sid: "AllowPullingImagesFromEcr",
        Effect: "Allow",
        Action: [
            "ecr:BatchGetImage",
            "ecr:GetDownloadUrlForLayer",
        ],
        Resource: "012345678901.dkr.ecr.eu-west-1.amazonaws.com/some-repo/busybox",
    }
  ]
}

Procedure:

Configure the default AWS Region for the AWS SDK to use.
```
[default]
region = eu-west-1
```

Install the gitlab runner on the manager instance.
Configure it to use the docker-autoscaler executor.

concurrent = 10

[[runners]]
  name = "docker autoscaler"
  url = "https://gitlab.example.org"
  token = "<token>"
  executor = "docker-autoscaler"

  [runners.docker]
    image = "012345678901.dkr.ecr.eu-west-1.amazonaws.com/some-repo/busybox:latest"

  [runners.autoscaler]
    plugin = "aws"
    max_instances = 10

    [runners.autoscaler.plugin_config]
      name = "my-docker-asg"  # the required ASG name

    [[runners.autoscaler.policy]]
      idle_count = 5
      idle_time = "20m0s"

Install the fleeting plugin.
```
gitlab-runner fleeting install
```

Docker Machine executor

Deprecated in GitLab 17.5.
If using this executor with EC2 instances, Azure Compute, or GCE, migrate to the GitLab Runner Autoscaler.

Supported cloud providers.

Using this executor opens up specific configuration settings.

Pitfalls:

On AWS, the driver supports only one subnet (and hence 1 AZ) per runner.
See AWS driver does not support multiple non default subnets and Docker Machine's AWS driver's options.

Example configuration

# Number of jobs *in total* that can be run concurrently by *all* configured runners
# Does *not* affect the *total* upper limit of VMs created by *all* providers
concurrent = 40

[[runners]]
  name = "static-scaler"

  url = "https://gitlab.example.org"
  token = "abcdefghijklmnopqrst"

  executor = "docker+machine"
  environment = [ "AWS_REGION=eu-west-1" ]

  # Number of jobs that can be run concurrently by the VMs created by *this* runner
  # Defines the *upper limit* of how many VMs can be created by *this* runner, since it is 1 task per VM at a time
  limit = 10

  [runners.machine]
    # Static number of VMs that need to be idle at all times
    IdleCount = 0

    # Remove VMs after 5m in the idle state
    IdleTime = 300

    # Maximum number of VMs that can be added to this runner in parallel
    # Defaults to 0 (no limit)
    MaxGrowthRate = 1

    # Template for the VMs' names
    # Must contain '%s'
    MachineName = "static-ondemand-%s"

    MachineDriver = "amazonec2"
    MachineOptions = [
      # Refer the correct driver at 'https://gitlab.com/gitlab-org/ci-cd/docker-machine/-/tree/main/docs/drivers'
      "amazonec2-region=eu-west-1",
      "amazonec2-vpc-id=vpc-1234abcd",
      "amazonec2-zone=a",                              # driver limitation, only 1 allowed
      "amazonec2-subnet-id=subnet-0123456789abcdef0",  # subnet-id in the specified az
      "amazonec2-use-private-address=true",
      "amazonec2-private-address-only=true",
      "amazonec2-security-group=GitlabRunners",

      "amazonec2-instance-type=m6i.large",
      "amazonec2-root-size=50",
      "amazonec2-iam-instance-profile=GitlabRunnerEc2",
      "amazonec2-tags=Team,Infrastructure,Application,Gitlab Runner,SpotInstance,False",
    ]

[[runners]]
  name = "dynamic-scaler"
  executor = "docker+machine"
  limit = 40  # will still respect the global concurrency value

  [runners.machine]
    # With 'IdleScaleFactor' defined, this becomes the upper limit of VMs that can be idle at all times
    IdleCount = 10

    # *Minimum* number of VMs that need to be idle at all times when 'IdleScaleFactor' is defined
    # Defaults to 1; will be set automatically to 1 if set lower than that
    IdleCountMin = 1

    # Number of VMs that need to be idle at all times, as a factor of the number of machines in use
    # In this case: idle VMs = 1.0 * machines in use, min 1, max 10
    # Must be a floating point number
    # Defaults to 0.0
    IdleScaleFactor = 1.0

    IdleTime = 600

    # Remove VMs after 250 jobs
    # Keeps them fresh
    MaxBuilds = 250

    MachineName = "dynamic-spot-%s"
    MachineDriver = "amazonec2"
    MachineOptions = [
      # Refer the correct driver at 'https://gitlab.com/gitlab-org/ci-cd/docker-machine/-/tree/main/docs/drivers'
      "amazonec2-region=eu-west-1",
      "amazonec2-vpc-id=vpc-1234abcd",
      "amazonec2-zone=b",                              # driver limitation, only 1 allowed
      "amazonec2-subnet-id=subnet-abcdef0123456789a",  # subnet-id in the specified az
      "amazonec2-use-private-address=true",
      "amazonec2-private-address-only=true",
      "amazonec2-security-group=GitlabRunners",

      "amazonec2-instance-type=r7a.large",
      "amazonec2-root-size=25",
      "amazonec2-iam-instance-profile=GitlabRunnerEc2",
      "amazonec2-tags=Team,Infrastructure,Application,Gitlab Runner,SpotInstance,True",

      "amazonec2-request-spot-instance=true",
      "amazonec2-spot-price=0.3",
    ]

    # Pump up the volume of available VMs during working hours
    [[runners.machine.autoscaling]]
      Periods = ["* * 9-17 * * mon-fri *"] # Every work day between 9 and 18 Amsterdam time
      Timezone = "Europe/Amsterdam"

      IdleCount = 20
      IdleCountMin = 5
      IdleTime = 3600

      # In this case: idle VMs = 1.5 * machines in use, min 5, max 20
      IdleScaleFactor = 1.5

    # Reduce even more the number of available VMs during the weekends
    [[runners.machine.autoscaling]]
      Periods = ["* * * * * sat,sun *"]
      IdleCount = 0
      IdleTime = 120
      Timezone = "UTC"

Instance executor

Refer Instance executor.

Autoscale-enabled executor that creates instances on-demand to accommodate the expected volume of jobs processed by the runner manager.

Useful when jobs need full access to the host instance, operating system, and attached devices.
Can be configured to accommodate single and multi-tenant jobs with various levels of isolation and security.

Autoscaling

Refer GitLab Runner Autoscaling.

GitLab Runner can automatically scale using public cloud instances when configured to use an autoscaler.

Autoscaling options are available for public cloud instances and the following orchestration solutions:

OpenShift.
Kubernetes.
Amazon ECS clusters using Fargate.

Docker Machine

Refer Autoscaling GitLab Runner on AWS EC2.

One or more runners must act as managers, and be configured to use the Docker Machine executor.
Managers interact with the cloud infrastructure to create multiple runner instances to execute jobs.
Cloud instances acting as managers shall not be spot instances.

GitLab Runner Autoscaler

Refer GitLab Runner Autoscaler.

Successor to the Docker Machine.

Composed of:

Taskscaler: manages autoscaling logic, bookkeeping, and fleets creations.
Fleeting: abstraction for cloud-provided virtual machines.
Cloud provider plugin: handles the API calls to the target cloud platform.

One or more runners must act as managers.
Managers interact with the cloud infrastructure to create multiple runner instances to execute jobs.
Cloud instances acting as managers shall not be spot instances.

Managers must be configured to use one or more of the specific executors for autoscaling:

Instance executor.
Docker Autoscaling executor.

Kubernetes

Store tokens in secrets instead of putting the token in the chart's values.

Requirements:

A running and configured Gitlab instance.
A running Kubernetes cluster.

Installation procedure

[best practice] Create a dedicated namespace:
```
kubectl create namespace 'gitlab'
```

Create a runner in gitlab:

Web UI

Go to one's Gitlab instance's /admin/runners page.
Click on the New instance runner button.
Keep Linux as runner type.
Click on the Create runner button.
Copy the runner's token.

API

curl -X 'POST' 'https://gitlab.example.org/api/v4/user/runners' -H 'PRIVATE-TOKEN: glpat-m-…' \
  -d 'runner_type=instance_type' -d 'tag_list=small,instance' -d 'run_untagged=false' -d 'a runner'

(Re-)Create the runners' Kubernetes secret with the runners' token from the previous step:

kubectl --namespace 'gitlab' delete secret 'gitlab-runner-token' --ignore-not-found
kubectl --namespace 'gitlab' create secret generic 'gitlab-runner-token' \
  --from-literal='runner-registration-token=""' --from-literal='runner-token=glrt-…'

[best practice] Be sure to match the runner version with the Gitlab server's:
```
helm search repo --versions 'gitlab/gitlab-runner'
```

Install the helm chart.

The secret's name must be matched in the helm chart's values file.

helm --namespace 'gitlab' upgrade --install 'gitlab-runner-manager' \
  --repo 'https://charts.gitlab.io' 'gitlab-runner' --version '0.69.0' \
  --values 'values.yaml' --set 'runners.secret=gitlab-runner-token'

Example helm chart values

gitlabUrl: https://gitlab.example.org/
unregisterRunners: true
concurrent: 20
checkInterval: 3
rbac:
  create: true
metrics:
  enabled: true
runners:
  name: "runner-on-k8s"
  secret: gitlab-runner-token
  config: |
    [[runners]]

      [runners.cache]
        Shared = true

      [runners.kubernetes]
        namespace = "{{.Release.Namespace}}"
        image = "alpine"
        pull_policy = [
          "if-not-present",
          "always"
        ]
        allowed_pull_policies = [
          "if-not-present",
          "always",
          "never"
        ]

        [runners.kubernetes.affinity]
          [runners.kubernetes.affinity.node_affinity]
            [runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution]
              [[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms]]
                [[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms.match_expressions]]
                  key = "org.example.reservation/app"
                  operator = "In"
                  values = [ "gitlab" ]
                [[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms.match_expressions]]
                  key = "org.example.reservation/component"
                  operator = "In"
                  values = [ "runner" ]
            [[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution]]
              weight = 1
              [runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference]
                [[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference.match_expressions]]
                  key = "eks.amazonaws.com/capacityType"
                  operator = "In"
                  values = [ "ON_DEMAND" ]
        [runners.kubernetes.node_tolerations]
          "reservation/app=gitlab" = "NoSchedule"
          "reservation/component=runner" = "NoSchedule"

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
            - key: eks.amazonaws.com/capacityType
              operator: In
              values:
                - ON_DEMAND
tolerations:
  - key: app
    operator: Equal
    value: gitlab
  - key: component
    operator: Equal
    value: runner
podLabels:
  team: engineering

Gotchas:

The build, helper and multiple service containers will all reside in a single pod.
If the sum of the resources request by all of them is too high, it will not be scheduled and the pipeline will hang and fail.
If any pod is killed due to OOM, the pipeline that spawned it will hang until it times out.

Improvements:

Keep the manager pod on stable nodes.

affinity:
  nodeAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        preference:
          matchExpressions:
            - key: eks.amazonaws.com/capacityType
              operator: In
              values:
                - ON_DEMAND

Dedicate specific nodes to runner executors.
Taint dedicated nodes and add tolerations and affinities to the runner's configuration.

[[runners]]
  [runners.kubernetes]

  [runners.kubernetes.node_selector]
    gitlab = "true"
    "kubernetes.io/arch" = "amd64"

    [runners.kubernetes.affinity]
      [runners.kubernetes.affinity.node_affinity]
        [runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution]
          [[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms]]
            [[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms.match_expressions]]
              key = "app"
              operator = "In"
              values = [ "gitlab-runner" ]
            [[runners.kubernetes.affinity.node_affinity.required_during_scheduling_ignored_during_execution.node_selector_terms.match_expressions]]
              key = "customLabel"
              operator = "In"
              values = [ "customValue" ]

          [[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution]]
            weight = 1

            [runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference]
              [[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.preference.match_expressions]]
                key = "eks.amazonaws.com/capacityType"
                operator = "In"
                values = [ "ON_DEMAND" ]

    [runners.kubernetes.node_tolerations]
      "app=gitlab-runner" = "NoSchedule"
      "node-role.kubernetes.io/master" = "NoSchedule"
      "custom.toleration=value" = "NoSchedule"
      "empty.value=" = "PreferNoSchedule"
      onlyKey = ""

Avoid massive resource consumption by defaulting to (very?) strict resource limits and 0 request.

[[runners]]
  [runners.kubernetes]
    cpu_request = "0"
    cpu_limit = "2"
    memory_request = "0"
    memory_limit = "2Gi"
    ephemeral_storage_request = "0"
    ephemeral_storage_limit = "512Mi"

    helper_cpu_request = "0"
    helper_cpu_limit = "0.5"
    helper_memory_request = "0"
    helper_memory_limit = "128Mi"
    helper_ephemeral_storage_request = "0"
    helper_ephemeral_storage_limit = "64Mi"

    service_cpu_request = "0"
    service_cpu_limit = "1"
    service_memory_request = "0"
    service_memory_limit = "0.5Gi"

Play nice and make sure to leave some space for the host's other workloads by allowing for resource request and limit override only up to a point.

[[runners]]
  [runners.kubernetes]
    cpu_limit_overwrite_max_allowed = "15"
    cpu_request_overwrite_max_allowed = "15"
    memory_limit_overwrite_max_allowed = "62Gi"
    memory_request_overwrite_max_allowed = "62Gi"
    ephemeral_storage_limit_overwrite_max_allowed = "49Gi"
    ephemeral_storage_request_overwrite_max_allowed = "49Gi"

    helper_cpu_limit_overwrite_max_allowed = "0.9"
    helper_cpu_request_overwrite_max_allowed = "0.9"
    helper_memory_limit_overwrite_max_allowed = "1Gi"
    helper_memory_request_overwrite_max_allowed = "1Gi"
    helper_ephemeral_storage_limit_overwrite_max_allowed = "1Gi"
    helper_ephemeral_storage_request_overwrite_max_allowed = "1Gi"

    service_cpu_limit_overwrite_max_allowed = "3.9"
    service_cpu_request_overwrite_max_allowed = "3.9"
    service_memory_limit_overwrite_max_allowed = "15.5Gi"
    service_memory_request_overwrite_max_allowed = "15.5Gi"
    service_ephemeral_storage_limit_overwrite_max_allowed = "15Gi"
    service_ephemeral_storage_request_overwrite_max_allowed = "15Gi"

27 KiB Raw Blame History