brandon/oam

Fork 0

mirror of https://gitea.com/mcereda/oam.git synced 2026-02-08 21:34:25 +00:00

Files

Michele Cereda 808b133e0c fix(ssm): reinstate gotcha about ssm polluting ansible modules' outputs

2025-08-13 01:33:47 +02:00

26 KiB

Raw Permalink Blame History

AWS Systems Manager

TL;DR
Requirements
Gotchas
Integrate with Ansible
Troubleshooting
1. Check node availability using ssm-cli
Further readings
1. Sources

TL;DR

Usage

# Get connection statuses.
aws ssm get-connection-status --target 'instance-id'

# Start sessions.
aws ssm start-session --target 'instance-id'

# Run commands.
aws ssm start-session \
  --target 'instance-id' \
  --document-name 'CustomCommandSessionDocument' \
  --parameters '{"logpath":["/var/log/amazon/ssm/amazon-ssm-agent.log"]}'
aws ssm send-command \
  --instance-ids 'i-0123456789abcdef0' \
  --document-name 'AWS-RunShellScript' \
  --parameters "commands="echo 'hallo'"

# Wait for commands execution.
aws ssm wait command-executed --instance-id 'i-0123456789abcdef0' --command-id 'abcdef01-2345-abcd-6789-abcdef012345'

# Get commands results.
aws ssm get-command-invocation --instance-id 'i-0123456789abcdef0' --command-id 'abcdef01-2345-abcd-6789-abcdef012345'
aws ssm get-command-invocation \
  --instance-id 'i-0123456789abcdef0' --command-id 'abcdef01-2345-abcd-6789-abcdef012345' \
  --query '{"status": Status, "rc": ResponseCode, "stdout": StandardOutputContent, "stderr": StandardErrorContent}'

Real world use cases

Also check out the snippets.

# Connect to instances if they are available.
instance_id='i-08fc83ad07487d72f' \
&& eval $(aws ssm get-connection-status --target "$instance_id" --query "Status=='connected'" --output 'text') \
&& aws ssm start-session --target "$instance_id" \
|| (echo "instance ${instance_id} not available" >&2 && false)

# Run commands and get their output.
instance_id='i-0915612f182914822' \
&& command_id=$(aws ssm send-command --instance-ids "$instance_id" \
  --document-name 'AWS-RunShellScript' --parameters 'commands="echo hallo"' \
  --query 'Command.CommandId' --output 'text') \
&& aws ssm wait command-executed --command-id "$command_id" --instance-id "$instance_id" \
&& aws ssm get-command-invocation --command-id "$command_id" --instance-id "$instance_id" \
  --query '{"status": Status, "rc": ResponseCode, "stdout": StandardOutputContent, "stderr": StandardErrorContent}'

Requirements

For instances to be managed by Systems Manager and be available in lists of managed nodes, it must:

Run a supported operating system.

Have the SSM Agent installed and running.

sudo dnf -y install 'amazon-ssm-agent'
sudo systemctl enable --now 'amazon-ssm-agent.service'

Have an AWS IAM instance profile attached with the correct permissions.
The instance profile enables the instance to communicate with the Systems Manager service. Alternatively, the instance must be registered to Systems Manager using hybrid activation.

The minimum permissions required are given by the Amazon-provided AmazonSSMManagedInstanceCore policy (arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore).
Be able to to connect to a Systems Manager endpoint through the SSM Agent in order to register with the service.
From there, the instance must be available to the service. This is confirmed by the service by sending a signal every five minutes to check the instance's health.

After the status of a managed node has been Connection Lost for at least 30 days, the node could be removed from the Fleet Manager console.
To restore it to the list, resolve the issues that caused the lost connection.

Check whether SSM Agent successfully registered with the Systems Manager service by executing the aws ssm describe-instance-associations-status command.
It won't return results until a successful registration has taken place.

aws ssm describe-instance-associations-status --instance-id 'instance-id'

Failed invocation

{
  "InstanceAssociationStatusInfos": []
}

Successful invocation

{
  "InstanceAssociationStatusInfos": [
    {
      "AssociationId": "51f0ed7e-c236-4c34-829d-e8f2a7a3bb4a",
      "Name": "AWS-GatherSoftwareInventory",
      "DocumentVersion": "1",
      "AssociationVersion": "2",
      "InstanceId": "i-0123456789abcdef0",
      "ExecutionDate": "2024-04-22T14:41:37.313000+02:00",
      "Status": "Success",
      "ExecutionSummary": "1 out of 1 plugin processed, 1 success, 0 failed, 0 timedout, 0 skipped. ",
      "AssociationName": "InspectorInventoryCollection-do-not-delete"
    },
    …
  ]
}

Gotchas

SSM starts shell sessions under /usr/bin (source):

Other shell profile configuration options
By default, Session Manager starts in the "/usr/bin" directory.

Avoid executing SSM through commands like xargs or parallel like in the following:

aws ec2 describe-instances --output text --query 'Reservations[].Instances[0].InstanceId' --filters … \
| xargs -ot aws ssm start-session --target

The middle commands start the session correctly, but will intercept traps like CTRL-C and stop their own execution terminating the SSM session.

Prefer using the describe-instance command's output as input for the start-session command instead:

aws ssm start-session --target "$( \
  aws ec2 describe-instances --output text --query 'Reservations[].Instances[0].InstanceId' --filters … \
)"

Integrate with Ansible

Create a dynamic inventory which name ends with aws_ec2.yml (e.g. test.aws_ec2.yml or simply aws_ec2.yml).
It needs to be named like that to be found by the 'community.aws.aws_ssm' connection plugin.

Refer the amazon.aws.aws_ec2 inventory for more information about the file specifications.

Important

Even if this is a YAML file, it must not start with '---'.
Ansible will fail parsing it in this case.

plugin: amazon.aws.aws_ec2
region: eu-north-1
include_filters:
  - # exclude instances that are not running, which are inoperable
    instance-state-name: running
exclude_filters:
  - tag-key:
      - aws:eks:cluster-name  # skip EKS nodes, since they are managed in their own way
  - # skip GitLab Runners, since they are volatile and managed in their own way
    tag:Application:
      - GitLab
    tag:Component:
      - Runner
use_ssm_inventory:
  # requires 'ssm:GetInventory' permissions on 'arn:aws:ssm:<region>:<account-id>:*'
  # this makes the sync fail miserably if configured on AWX inventories
  true
hostnames:
  - instance-id
    # acts as keyword to use the instances' 'InstanceId' attribute
    # use 'private-ip-address' to use the instances' 'PrivateIpAddress' attribute instead
    # or any option in <https://docs.aws.amazon.com/cli/latest/reference/ec2/describe-instances.html#options> really
keyed_groups:
  # add hosts to '<prefix>_<value>' groups for each aws_ec2 host's matching attribute
  # e.g.: 'arch_x86_64', 'os_Name_Amazon_Linux', 'tag_Name_GitLab_Server'
  - key: architecture
    prefix: arch
  - key: ssm_inventory.platform_name
    prefix: os_Name
  - key: ssm_inventory.platform_type
    prefix: os_Type
  - key: ssm_inventory.platform_version
    prefix: os_Version
  # - key: tags  # would create a group per each tag value; prefer limiting groups to the useful ones
  #   prefix: tag
  - key: tags.Team
    prefix: tag_Team
  - key: tags.Environment
    prefix: tag_Environment
  - key: tags.Application
    prefix: tag_Application
  - key: tags.Component
    prefix: tag_Component
  - key: tags.Name
    prefix: tag_Name
compose:
  # add extra host variables
  # use non-jinja values (e.g. strings) by wrapping them in two sets of quotes
  # if using awx, prefer keeping double quotes external (e.g. "'something'") as it just looks better in the ui
  ansible_connection: "'aws_ssm'"
  ansible_aws_ssm_region: "'eu-north-1'"
  ansible_aws_ssm_timeout: "'300'"

Pitfalls:

One shall not use the remote_user connection option, as it is not supported by the plugin.
From the plugin notes:

The community.aws.aws_ssm connection plugin does not support using the remote_user and ansible_user variables to configure the remote user. The become_user parameter should be used to configure which user to run commands as. Remote commands will often default to running as the ssm-agent user, however this will also depend on how SSM has been configured.
SSM sessions' duration is limited by SSM's idle session timeout setting.
That might impact tasks that need to run for more than said duration.

Some modules (e.g.: community.postgresql.postgresql_db) got their session terminated and SSM retried the task, killing and restarting the running process.
Since the process lasted more than the sessions' duration, it kept having its sessions terminated. The task failed when the SSM reached the set retries for the connection.

Consider extending the SSM idle session timeout setting, or using async tasks to circumvent this issue.
Mind that async tasks come with their own SSM caveats.
Since SSM starts shell sessions under /usr/bin, one must explicitly set Ansible's temporary directory to a folder the remote user can write to (source).
```
ANSIBLE_REMOTE_TMP="/tmp/.ansible/tmp" ansible…
```
```
# file: ansible.cfg
remote_tmp=/tmp/.ansible/tmp
```
```
 - hosts: all
+  vars:
+    ansible_remote_tmp: /tmp/.ansible/tmp
   tasks: …
```
This, or use the shell profiles in SSM's preferences to change the directory when logged in.

In similar fashion to the point above, SSM might mess up the directory used by async tasks.
To avoid this, set it to a folder the remote user can write to.

ANSIBLE_ASYNC_DIR="/tmp/.ansible-${USER}/async" ansible…

# file: ansible.cfg
async_dir=/tmp/.ansible-${USER}/async

 - hosts: all
+  vars:
+    ansible_async_dir: /tmp/.ansible/async
   tasks: …

Depending on how one configured SSM, it could happen to scramble or pollute the output of some Ansible modules.
When this happens, the module might report a failure, but the process will run. This happened especially frequently when using async tasks.

Task:

- name: Download S3 object
  ansible.builtin.command:
    cmd: >-
      aws s3 cp 's3://some-bucket/some.object' '{{ ansible_user_dir }}/some.object'
    creates: "{{ ansible_user_dir }}/some.object"
  async: 900
  poll: 0  # fire and forget, since ssm would not allow self-checking anyways
  register: s3_object_download
- name: Check on the S3 object download task
   when:
     - s3_object_download is not skipped
     - s3_object_download is not failed
   vars:
     ansible_aws_ssm_timeout: 900  # keep the connection active the whole time
   ansible.builtin.async_status:
     jid: "{{ s3_object_download.ansible_job_id }}"
   register: s3_object_download_result
   until: s3_object_download_result.finished
   retries: 5
   delay: 60

Task output:

{
  "module_stdout": "\u001b]0;@ip-172-31-33-33:/usr/bin\u0007{\"failed\": 0, \"started\": 1, \"finished\": 0, \"ansible_job_id\": \"j924541890996.43612\", \"results_file\": \"/tmp/.ansible/async/j924541890996.43612\", \"_ansible_suppress_tmpdir_delete\": true}\r\r",
  "module_stderr": "",
  "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
  "rc": 0,
  "changed": true,
  "_ansible_no_log": false,
  "failed_when_result": false
}

To get around this, one will need to set custom conditions for changed and failed states by parsing the polluted output and checking the specific values required.

 - name: Download S3 object
   ansible.builtin.command:
     cmd: >-
       aws s3 cp 's3://some-bucket/some.object' '{{ ansible_user_dir }}/some.object'
     creates: "{{ ansible_user_dir }}/some.object"
   async: 900
   poll: 0  # fire and forget, since ssm would not allow self-checking anyways
   register: s3_object_download
+  changed_when:
+    - "'started' | extract(s3_object_download.module_stdout | regex_search('{.*}') | from_json) == 1"
+    - "'failed'  | extract(s3_object_download.module_stdout | regex_search('{.*}') | from_json) == 0"
+  failed_when: "'failed' | extract(s3_object_download.module_stdout | regex_search('{.*}') | from_json) == 1"
 - name: Check on the S3 object download task
   when:
     - s3_object_download is not skipped
     - s3_object_download is not failed
   vars:
     ansible_aws_ssm_timeout: 900  # keep the connection active the whole time
+    s3_object_download_stdout_as_obj: >-
+      {{ s3_object_download.module_stdout | regex_search('{.*}') | from_json }}
   ansible.builtin.async_status:
-    jid: "{{ s3_object_download.ansible_job_id }}"
+    jid: "{{ s3_object_download_stdout_as_obj.ansible_job_id }}"
   register: s3_object_download_result
   until: s3_object_download_result.finished
   retries: 5
   delay: 60

When using async tasks, SSM fires up the task and disconnects.
This made the task fail at some point. Even so, the process still runs on the target host.

{
  "failed": 0,
  "started": 1,
  "finished": 0,
  "ansible_job_id": "j604343782826.4885",
  "results_file": "/tmp/.ansible/async/j604343782826.4885",
  "_ansible_suppress_tmpdir_delete": true
}

Fire these tasks with poll set to 0 and forcing a specific failure test.
Then, use a different task to check up on them.

Important

When checking up tasks with ansible.builtin.async_status, SSM will use a single connection.
Consider keeping alive said connection until the end of the task.

FIXME: check. This seems to not be needed anymore.

Example

- name: Dump a PostgreSQL DB from an RDS instance
  hosts: all
  vars:
    ansible_connection: amazon.aws.aws_ssm
    ansible_remote_tmp: /tmp/.ansible/tmp             #-- see pitfalls (ssm starts sessions in '/usr/bin')
    ansible_async_dir: /tmp/.ansible/async            #-- see pitfalls (ssm starts sessions in '/usr/bin')
    pg_dump_max_wait_in_seconds: "{{ 60 * 60 * 2 }}"  #-- wait up to 2 hours (60s * 60m * 2h)
    pg_dump_check_delay_in_seconds: 60                #-- avoid overloading the ssm agent with sessions
    pg_dump_check_retries:                            #-- max_wait/delay
      "{{ pg_dump_max_wait_in_seconds/pg_dump_check_delay_in_seconds) }}"
  tasks:
    - name: Dump the DB from the RDS instance
      community.postgresql.postgresql_db: { … }
      async: "{{ pg_dump_max_wait_in_seconds | int }}"
      poll: 0                           #-- fire and forget; ssm would not allow self-checking anyways
      register: pg_dump_task_execution  #-- expected: { failed: 0, started: 1, finished: 0, ansible_job_id: … }
      changed_when:
        - pg_dump_task_execution.started == 1
        - pg_dump_task_execution.failed  == 0
      failed_when: pg_dump_task_execution.failed  == 1  #-- specify the failure yourself
    - name: Check on the PG dump task
      vars:
        ansible_aws_ssm_timeout: "{{ pg_dump_max_wait_in_seconds }}"  #-- keep the connection active the whole time
        ansible_job_id: "{{ dump_stdout_as_obj.ansible_job_id }}"
      ansible.builtin.async_status:
        jid: "{{ pg_dump_task_execution.ansible_job_id }}"
      register: pg_dump_task_execution_result
      until: pg_dump_task_execution_result.finished
      retries: "{{ pg_dump_check_retries | int }}"          #-- mind the argument's type
      delay: "{{ pg_dump_check_delay_in_seconds | int }}"   #-- mind the argument's type

Troubleshooting

Refer Troubleshooting managed node availability.

Check the Requirements are satisfied.
Check node availability using ssm-cli.

Check node availability using `ssm-cli`

Refer Troubleshooting managed node availability using ssm-cli.

From the managed instance:

$ sudo dnf -y install 'amazon-ssm-agent'
$ sudo systemctl enable --now 'amazon-ssm-agent.service'
$ sudo ssm-cli get-diagnostics --output 'table'
┌──────────────────────────────────────┬─────────┬─────────────────────────────────────────────────────────────────────┐
│ Check                                │ Status  │ Note                                                                │
├──────────────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────┤
│ EC2 IMDS                             │ Success │ IMDS is accessible and has instance id i-0123456789abcdef0 in       │
│                                      │         │ region eu-west-1                                                    │
├──────────────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────┤
│ Hybrid instance registration         │ Skipped │ Instance does not have hybrid registration                          │
├──────────────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────┤
│ Connectivity to ssm endpoint         │ Success │ ssm.eu-west-1.amazonaws.com is reachable                            │
├──────────────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────┤
│ Connectivity to ec2messages endpoint │ Success │ ec2messages.eu-west-1.amazonaws.com is reachable                    │
├──────────────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────┤
│ Connectivity to ssmmessages endpoint │ Success │ ssmmessages.eu-west-1.amazonaws.com is reachable                    │
├──────────────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────┤
│ Connectivity to s3 endpoint          │ Success │ s3.eu-west-1.amazonaws.com is reachable                             │
├──────────────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────┤
│ Connectivity to kms endpoint         │ Success │ kms.eu-west-1.amazonaws.com is reachable                            │
├──────────────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────┤
│ Connectivity to logs endpoint        │ Success │ logs.eu-west-1.amazonaws.com is reachable                           │
├──────────────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────┤
│ Connectivity to monitoring endpoint  │ Success │ monitoring.eu-west-1.amazonaws.com is reachable                     │
├──────────────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────┤
│ AWS Credentials                      │ Success │ Credentials are for                                                 │
│                                      │         │ arn:aws:sts::012345678901:assumed-role/managed/i-0123456789abcdef0  │
│                                      │         │ and will expire at 2024-04-22 18:19:48 +0000 UTC                    │
├──────────────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────┤
│ Agent service                        │ Success │ Agent service is running and is running as expected user            │
├──────────────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────┤
│ Proxy configuration                  │ Skipped │ No proxy configuration detected                                     │
├──────────────────────────────────────┼─────────┼─────────────────────────────────────────────────────────────────────┤
│ SSM Agent version                    │ Success │ SSM Agent version is 3.3.131.0 which is the latest version          │
└──────────────────────────────────────┴─────────┴─────────────────────────────────────────────────────────────────────┘

26 KiB Raw Permalink Blame History