brandon/oam

Fork 0

mirror of https://gitea.com/mcereda/oam.git synced 2026-02-09 05:44:23 +00:00

Files

Michele Cereda 264353654a feat: cost-saving measures

2026-01-13 17:58:40 +01:00

20 KiB

Raw Blame History

Amazon OpenSearch Service

Amazon offering for managed OpenSearch clusters.

Storage
1. UltraWarm storage
2. Cold storage
Operations
Index state management plugin
Snapshots
Best practices
1. Dedicated master nodes
Cost-saving measures
Further readings
1. Sources

Storage

Clusters can be set up to use the hot-warm architecture.
Compared to the plain OpenSearch product, AWS' managed OpenSearch service offers the two extra UltraWarm and Cold storage options.

Hot storage provides the fastest possible performance for indexing and searching new data.
Data nodes use hot storage in the form of instance stores or EBS volumes attached to each node.

Indices that are not actively written to (e.g., immutable data like logs), that are queried less frequently, or that don't need the hot storage's performance can be moved to warm storage.

Warm indices are read-only unless returned to hot storage.
Aside that, they behave like any other hot index.

UltraWarm nodes use warm storage in the form of S3 and caching.

Cold storage is meant for data accessed only occasionally or no longer in active use.
Cold indices are normally detached from nodes and stored in S3, meaning one can't read from nor write to cold indices by default. Should one need to query them, one needs to selectively attach them to UltraWarm nodes.

If using the hot-warm architecture, leverage the Index State Management plugin to automate indices migration to lower storage states after they meet specific conditions.

UltraWarm storage

Refer UltraWarm storage for Amazon OpenSearch Service.

Requirements:

OpenSearch/ElasticSearch >= v6.8.
Dedicated master nodes.
No t2 nor t3 instances types as data nodes.
When using a Multi-AZ architecture with Standby domain, the number of warm nodes must be a multiple of the number of Availability Zones being used.
Others.

Considerations:

When calculating UltraWarm storage requirements, consider only the size of the primary shards.
S3 removes the need for replicas and abstracts away any operating system or service considerations.
Dashboards and _cat/indices will still report UltraWarm index size as the total of all primary and replica shards.
There are limits to the amount of storage each instance type can address and the maximum number of warm nodes supported by Domains.
Amazon recommends a maximum shard size of 50 GiB.
Upon enablement, UltraWarm might not be available to use for several hours even if the domain state is Active.
The minimum amount of UltraWarm instances allowed by AWS is 2.

Before disabling UltraWarm, one must either delete all warm indices or migrate them back to hot storage.
After warm storage is empty, wait five minutes before attempting to disable UltraWarm.

Cold storage

Refer Cold storage for Amazon OpenSearch Service.

Requirements:

OpenSearch/ElasticSearch >= v7.9.
UltraWarm storage enabled for the same domain.

Considerations:

One can't read from, nor write to, cold indices.

Operations

Migrate indices to UltraWarm storage

Indices' health must be green to perform migrations.

Migrations are executed one index at a time, sequentially.
There can be up to 200 migrations in the queue.
Any request that exceeds the limit will be rejected.

Index migrations to UltraWarm storage require a force merge operation, which purges documents that were marked for deletion.
By default, UltraWarm merges indices into one segment. One can set this value up to 1000.

Migrations might fail during snapshots, shard relocations, or force merges.
Failures during snapshots or shard relocation are typically due to node failures or S3 connectivity issues.
Lack of disk space is usually the underlying cause of force merge failures.

Start migration:

POST _ultrawarm/migration/my-index/_warm

Check the migration's status:

GET _ultrawarm/migration/my-index/_status

{
  "migration_status": {
    "index": "my-index",
    "state": "RUNNING_SHARD_RELOCATION",
    "migration_type": "HOT_TO_WARM",
    "shard_level_status": {
      "running": 0,
      "total": 5,
      "pending": 3,
      "failed": 0,
      "succeeded": 2
    }
  }
}

If a migration is in the queue but has not yet started, it can be removed from the queue:

POST _ultrawarm/migration/_cancel/my-index

Return warm indices to hot storage

Migrate them back to hot storage:

POST _ultrawarm/migration/my-index/_hot

There can be up to 10 queued migrations from warm to hot storage at a time.
Migrations requests are processed one at a time in the order they were queued.

Indices return to hot storage with one replica.

Migrate indices to Cold storage

As for UltraWarm storage, just change the endpoints accordingly:

POST _ultrawarm/migration/my-index/_cold
GET _ultrawarm/migration/my-index/_status
POST _ultrawarm/migration/_cancel/my-index

GET _cold/indices/_search

POST _cold/migration/_warm
GET _cold/migration/my-index/_status
POST _cold/migration/my-index/_cancel

Index state management plugin

Refer OpenSearch's Index State Management plugin and Index State Management in Amazon OpenSearch Service.

Compared to OpenSearch and ElasticSearch, ISM for Amazon's managed OpenSearch service has several differences:

The managed OpenSearch service supports the three unique ISM operations warm_migration, cold_migration, and cold_delete.

If one's domain has UltraWarm storage enabled, the warm_migration action transitions indices to warm storage.
If one's domain has cold storage enabled, the cold_migration action transitions indices to cold storage, and the cold_delete action deletes them from cold storage.

Should one of these actions not complete within the set timeout period, the migration or deletion of the affected indices will continue.
Setting an error_notification for one of the above actions will send a notification about the action failing, should it not complete within the timeout period, but the notification is only for one's own reference. The actual operation has no inherent timeout, and will continue to run until it eventually succeeds or fails.
[should the domain run OpenSearch or Elasticsearch 7.4 or later] The managed OpenSearch service supports the ISM open and close operations.
[should the domain run OpenSearch or Elasticsearch 7.7 or later] The managed OpenSearch service supports the ISM snapshot operation.
Cold indices API:
- Require specifying the ?type=_cold parameter when you use the following ISM APIs:
  - Add policy
  - Remove policy
  - Update policy
  - Retry failed index
  - Explain index
- Do not support wildcard operators, except when used at the end of the path.
  I.E., _plugins/_ism/add/logstash-* is supported, but _plugins/_ism/add/iad-*-prod is not.
- Do not support multiple index names and patterns.
  I.E., _plugins/_ism/remove/app-logs is supported, but _plugins/_ism/remove/app-logs,sample-data is not.
The managed OpenSearch service allows to change only the following ISM settings:
- plugins.index_state_management.enabled and plugins.index_state_management.history.enabled at cluster level.
- plugins.index_state_management.rollover_alias at index level.

Snapshots

Refer Snapshots and Creating index snapshots in Amazon OpenSearch Service.

AWS-managed OpenSearch Service snapshots come in the following forms:

Automated snapshots: only for cluster recovery, stored in a preconfigured S3 bucket at no additional cost.
One can use them to restore the domain in the event of red cluster status or data loss.
Manual snapshots: for cluster recovery or moving data from one cluster to another.
Users must be those initiating manual snapshots.
These snapshots are stored in one's own S3 bucket. Standard S3 charges apply.

All AWS-managed OpenSearch Service domains take automated snapshots, but with a frequency difference:

Domains running OpenSearch or Elasticsearch 5.3 and later take hourly automated snapshots and retain up to 336 of them for 14 days.
Domains running Elasticsearch 5.1 and earlier take daily automated snapshots during off-peak hours and retain up to 14 of them. No snapshot data is retained for more than 30 days.

Important

Should a cluster enter the red status, all automated snapshots will fail for the time that status persists.

To be able to create snapshots manually:

An S3 bucket must exist to store snapshots.

Important

Manual snapshots do not support the S3 Glacier storage class.
Do not apply any S3 Glacier lifecycle rule to this bucket.

An IAM role that delegates permissions to the OpenSearch Service must be defined.
This role must be able to act on the S3 bucket above.

Trust relationship (A.K.A. assume role policy)

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "es.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}

Policy

{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": [
      "s3:ListBucket",
      "s3:GetObject",
      "s3:PutObject",
      "s3:DeleteObject"
    ],
    "Resource": [
      "arn:aws:s3:::{{ bucket name here }}",
      "arn:aws:s3:::{{ bucket name here }}/*"
    ]
  }]
}

The IAM user or role whose credentials will be used to sign the requests must have permissions to:
- Pass the role above to the OpenSearch Service.
  Policy
```
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "iam:PassRole",
    "Resource": "arn:aws:iam::{{ aws account id }}:role/{{ role name }}"
  }]
}
```
  Should one use the domain's dashboards' dev tools, and should the domain use Cognito for authentication, those permissions need to be added to the IAM role that cognito uses for the user pool.
  
  Should the user or role making the requests be missing such permissions, they might encounter this error when trying to register a repository in the next step:
  
  User: arn:aws:iam::123456789012:user/MyUserAccount is not authorized to perform: iam:PassRole on resource: arn:aws:iam::123456789012:role/TheSnapshotRole
- Use the es:ESHttpPut action in the domain.
  Policy
```
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Action": "es:ESHttpPut",
    "Resource": "arn:aws:es:{{ region}}:{{ aws account id }}:domain/{{ domain name }}/*"
  }]
}
```

Snapshots can be taken only from indices in the hot or warm storage tiers.
Only one index from warm storage is allowed at a time, and the request cannot contain indices in mixed tiers.

Best practices

Refer Operational best practices for Amazon OpenSearch Service and Best practices for configuring your Amazon OpenSearch Service domain.

Use dedicated master nodes in production clusters.
Use Multi-AZ deployments in production clusters.

Dedicated master nodes

Refer Dedicated master nodes in Amazon OpenSearch Service.

They increase cluster stability by performing cluster management tasks.
They do not hold data nor respond to data upload requests.

Only one of the dedicated master nodes is active, while the others wait as backup in case the active dedicated master node fails.

All data upload requests are served by the data nodes, while all cluster management tasks are offloaded to the active dedicated master node. Cluster management tasks are:

Tracking all nodes in the cluster.
Maintaining routing information for nodes in the cluster.
Tracking the number of indices in the cluster.
Tracking the number of shards belonging to each index.
Updating the cluster state after state changes.
I.e., creating an index and adding or removing nodes in the cluster.
Replicating changes to the cluster state across all nodes in the cluster.
Monitoring the health of all cluster nodes by sending heartbeat signals.

Use Multi-AZ with Standby adds three dedicated master nodes to each OpenSearch Service domain it is enabled for.

Even deploying in Single-AZ mode, three dedicated master nodes are recommended for stability.
In any case, never choose an even number of dedicated master nodes to avoid split brain problems.

If a cluster has an even number of master-eligible nodes, OpenSearch and Elasticsearch versions 7.x and later will ignore one node so that the voting configuration is always an odd number.
As such, an even number of dedicated master nodes are essentially equivalent to that number - 1.

If a cluster doesn't have the necessary quorum to elect a new master node, write and read requests to the cluster will both fail.
This behavior differs from the OpenSearch default.

Master nodes size is highly correlated with the data instance size and the number of instances, indices, and shards they can manage.

Cost-saving measures

Choose appropriate instance types and sizes.
Leverage the ability to select them to tailor the service offering to one's needs.

OR1 instances cannot (currently?) be selected as master nodes.
They must also be selected at domain creation.
Consider using reserved instances for long-term savings.
Enable index-level compression to save storage space and reduce I/O costs.
Use Index Lifecycle Management policies to move old data in lower storage tiers.
Consider using S3 as data store for infrequently accessed or archived data.
Consider adjusting the frequency and retention period of snapshots.
By default, AWS OpenSearch takes daily snapshots and retains them for 14 days.
If using gp2 EBS volumes, move to gp3.
Enable autoscaling (serverless only).
Optimize indices' sharding and replication.
Optimize queries.
Optimize data ingestion.
Optimize indices' mapping and settings.
Optimize the JVM heap size.
Summarize and compress historical data using index rollups.
Check out caches.
Reduce the number of requests using throttling and rate limiting.
Move to Single-AZ deployments.
Filter out and compress source data before sending it to OpenSearch to reduce the storage footprint and data transfer costs.
Share a single OpenSearch cluster with multiple accounts to reduce the overall number of instances and resources.

20 KiB Raw Blame History