Files
oam/knowledge base/opensearch.md
2024-06-18 01:42:22 +02:00

17 KiB

Title

Search and analytics suite forked from ElasticSearch by Amazon.
Makes it easy to ingest, search, visualize, and analyze data.

Use cases: application search, log analytics, data observability, data ingestion, others.

  1. Concepts
    1. Update lifecycle
    2. Translog
    3. Refresh operations
    4. Flush operations
    5. Merge operations
    6. Node types
    7. Indexes
  2. Requirements
  3. Quickstart
  4. Tuning
  5. The split brain problem
  6. APIs
  7. Hot-warm architecture
  8. Further readings
    1. Sources

Concepts

Documents are the unit storing information.
Information is text or structured data.
Documents are stored in the JSON format and returned when related information is searched for.

Indexes are collections of documents.
Its contents are queried when information is searched for.

OpenSearch is designed to be a distributed search engine running on one or more nodes.
Nodes are servers that store data and process search requests.

Clusters are collections of nodes allowing for different responsibilities to be taken on by different node types.
In each cluster a cluster manager node is elected. It orchestrates cluster-level operations such as creating an index.

Nodes in clusters communicate with each other: if a request is routed to a node, it sends requests to other nodes, gathers their responses, and returns the final response.

Indexes are split into shards, each of them storing a subset of all documents in an index.
Shards are evenly distributed across nodes in a cluster.
Each shard is effectively a full Lucene index. Since each instance of Lucene is a running process consuming CPU and memory, having more shards is not necessarily better.

Shards may be either primary (original) replicas (copy).
By default, one replica shard is created for each primary shard.

OpenSearch distributes replica shards to different nodes than their corresponding primary shards so that replica shards act as backups in the event of node failures.
Replicas also improve the speed at which the cluster processes search requests, encouraging the use of more than one replica per index for each search-heavy workload.

Indexes uses a data structure called an inverted index. It maps words to the documents in which they occur.
When searching, OpenSearch matches the words in the query to the words in the documents. Each document is assigned a relevance score saying how well the document matched the query.

Individual words in a search query are called search terms, and each is scored according to the following rules:

  • Search terms that occur more frequently in a document will tend to be scored higher.
    This is the term frequency component of the score.
  • Search terms that occur in more documents will tend to be scored lower.
    This is the inverse document frequency component of the score.
  • Matches on longer documents should tend to be scored lower than matches on shorter documents.
    This corresponds to the length normalization component of the score.

OpenSearch uses the Okapi BM25 ranking algorithm to calculate document relevance scores and then returns the results sorted by relevance.

Update lifecycle

Update operations consist of the following steps:

  1. An update is received by a primary shard.
  2. The update is written to the shard's transaction log translog.
  3. The translog is flushed to disk and followed by an fsync before the update is acknowledged to guarantee durability.
  4. The update is passed to the Lucene index writer, which adds it to an in-memory buffer.
  5. On a refresh operation, the Lucene index writer flushes the in-memory buffers to disk.
    Each buffer becomes a new Lucene segment.
  6. A new index reader is opened over the resulting segment files.
    The updates are now visible for search.
  7. On a flush operation, the shard fsyncs the Lucene segments.
    Because the segment files are a durable representation of the updates, the translog is no longer needed to provide durability. The updates can be purged from the translog.

Translog

Transition log making updates durable.

Indexing or bulk calls respond when the documents have been written to the translog and the translog is flushed to disk.
Updates will not be visible to search requests until after a refresh operation.

Refresh operations

Performed periodically to write the documents from the in-memory Lucene index to files.
These files are not guaranteed to be durable, because an fsync is not performed at this point.

A refresh makes documents available for search.

Flush operations

Persist the files to disk using fsync, ensuring durability.
Flushing ensures that the data stored only in the translog is recorded in the Lucene index.

Flushes are performed as needed to ensure that the translog does not grow too large.

Merge operations

Shards are Lucene indexes, which consist of segments (or segment files).
Segments store the indexed data and are immutable.

Smaller segments are merged into larger ones periodically to reduce the overall number of segments on each shard, free up disk space, and improve search performance.

Eventually, segments reach a maximum size and are no longer merged into larger segments.
Merge policies specify the maximum size and how often merges are performed.

Node types

Node type Description Best practices for production
Cluster manager Manages the overall operation of a cluster and keeps track of the cluster state.
This includes creating and deleting indexes, keeping track of the nodes that join and leave the cluster, checking the health of each node in the cluster (by running ping requests), and allocating shards to nodes.
Three dedicated cluster manager nodes in three different availability zones ensures the cluster never loses quorum.
Two nodes will be idle for most of the time, except when one node goes down or needs some maintenance.
Cluster manager eligible Elects one node among them as the cluster manager node through a voting process. Make sure to have dedicated cluster manager nodes by marking all other node types as not cluster manager eligible.
Data Stores and searches data.
Performs all data-related operations (indexing, searching, aggregating) on local shards.
These are the worker nodes and need more disk space than any other node type.
Keep them balanced between zones.
Storage and RAM-heavy nodes are recommended.
Ingest Pre-processes data before storing it in the cluster.
Runs an ingest pipeline that transforms data before adding it to an index.
Use dedicated ingest nodes if you plan to ingest a lot of data and run complex ingest pipelines.
Optionally offload your indexing from the data nodes so that they are used exclusively for searching and aggregating.
Coordinating Delegates client requests to the shards on the data nodes, collects and aggregates the results into one final result, and sends this result back to the client. Prevent bottlenecks for search-heavy workloads using a couple of dedicated coordinating-only nodes.
Use CPUs with as many cores as you can.
Dynamic Delegates specific nodes for custom work (e.g.: machine learning tasks), preventing the consumption of resources from data nodes and therefore not affecting functionality.
Search Provides access to searchable snapshots.
Incorporates techniques like frequently caching used segments and removing the least used data segments in order to access the searchable snapshot index (stored in a remote long-term storage source, for example, Amazon S3 or Google Cloud Storage).
Use nodes with more compute (CPU and memory) than storage capacity (hard disk).

Each node is a cluster-manager-eligible, data, ingest, and coordinating node by default.
Number of nodes, assigning node types, and choosing the hardware for each node type should depend on one's own use case. One should take into account factors like the amount of time to hold on to data, the average size of documents, typical workload (indexing, searches, aggregations), expected price-performance ratio, risk tolerance, and so on.

After assessing all requirements, it is suggested to use benchmark testing tools like OpenSearch Benchmark.
Provision a small sample cluster and run tests with varying workloads and configurations. Compare and analyze the system and query metrics for these tests improve upon the architecture.

Indexes

Data is indexed using the REST API.

There are two indexing APIs: the index API and the _bulk API.
The Index API adds documents individually as they arrive, so it is intended for situations in which new data arrives incrementally (i.e., customer orders from a small business).
The _bulk API takes in one file lumping requests together, offering superior performance for situations in which the flow of data is less frequent and can be aggregated in a generated file.
Enormous documents should still be indexed individually.

When indexing documents, the document's _id must be 512 bytes or less in size.

Static index settings can only be updated on closed indexes.
Dynamic index settings can be updated at any time through the APIs.

Requirements

Port number Component
443 OpenSearch Dashboards in AWS OpenSearch Service with encryption in transit (TLS)
5601 OpenSearch Dashboards
9200 OpenSearch REST API
9300 Node communication and transport (internal), cross cluster search
9600 Performance Analyzer

For Linux hosts:

  • vm.max_map_count must be set to at least 262144.

Quickstart

Use docker compose.

  1. Disable memory paging and swapping on Linux hosts to improve performance and increase the number of maps available to the service:

    sudo swapoff -a
    sudo echo '262144' > '/proc/sys/vm/max_map_count'
    
  2. Get the sample compose file:

    curl -O 'https://raw.githubusercontent.com/opensearch-project/documentation-website/2.14/assets/examples/docker-compose.yml'
    
  3. Adjust the compose file and run it:

    docker compose up -d
    

Tuning

  • Disable swapping.
    If kept enabled, it can dramatically decrease performance and stability.
  • Avoid using network file systems for node storage in a production workflow.
    Using those can cause performance issues due to network conditions (i.e.: latency, limited throughput) or read/write speeds.
  • Use solid-state drives (SSDs) on the hosts for node storage where possible.
  • Set the size of the Java heap.
    Recommended to use half of the system's RAM.
  • Set up a hot-warm architecture.

The split brain problem

TODO

APIs

FIXME: expand

  • Close indexes.
    Disables read and write operations.

    POST /prometheus-logs-20231205/_close
    
  • (Re)Open closed indexes.
    Enables read and write operations.

    POST /prometheus-logs-20231205/_open
    
  • Update indexes' settings.
    Static settings can only be updated on closed indexes.

    PUT /prometheus-logs-20231205/_settings
    {
      "index": {
        "codec": "zstd_no_dict",
        "codec.compression_level": 3,
        "refresh_interval": "2s"
      }
    }
    

Hot-warm architecture

Refer Set up a hot-warm architecture.

Further readings

Sources