How to Create an Autoscaling Elasticsearch Cluster on Google Kubernetes Engine
Elasticsearch is a powerful, open-source search and analytics engine designed to handle large volumes of data quickly and in near real-time. It's built on Apache Lucene and provides a distributed, multitenant-capable full-text search engine. Some popular use cases include; website search, collecting and analyzing log data, using as vector database etc. Many businesses also leverage Elasticsearch for security information and event management (SIEM) to detect threats and monitor their systems. Ultimately, its speed and scalability make it a versatile tool for a wide range of data-driven applications.
Just like other distributed systems, setting up Elasticsearch with the right sizing is one of the important factors for its performance. If you are going to set up an on-premise cluster completely, changing the cluster size (adding/removing new nodes) can be operationally tiring. If you need to perform cluster resize at certain frequencies (for example, increasing it during campaign periods and decreasing it during off periods), then it will be inevitable to look for an alternative scenario.
To address these operational challenges, deploying an Elasticsearch cluster on Kubernetes offers a compelling and agile alternative. Kubernetes, as a container orchestration platform, inherently simplifies the process of scaling distributed systems like Elasticsearch. Your cluster can dynamically expand to handle the surge in traffic and seamlessly shrink back down during quieter times, all without manual intervention, thus optimizing resource utilization and significantly reducing the operational overhead.
For these operations, Elastic Cloud on Kubernetes (ECK) operator can be your savior. It’s the official and easiest way to deploy an autoscaling cluster on Kubernetes → https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s However, there is an important point to mention. The ECK Autoscaling feature comes with the Enterprise license. This means that you need to purchase a license to use this feature.
But what if you want to proceed with a free and open-source solution without a license? In this case, Zalando's Elasticsearch operator can come to your aid. → https://github.com/zalando-incubator/es-operator
Zalando SE is a publicly traded international online retailer based in Berlin which is active across Europe and specializes in shoes, fashion and beauty products.
We can specify the properties of this operator as follows:
- The operator works by managing custom resources called ElasticsearchDataSets (EDS). They are basically a thin wrapper around StatefulSets.
- One EDS represents a common group of Elasticsearch data nodes.
- It can scale in two dimensions, shards per node and number of replicas for the indices on that dataset.
- Has been tested with Elasticsearch 7.x and 8.x.
- The operator does not manage Elasticsearch master nodes. You can create them on your own.
- Do not operate manually on the StatefulSet. The operator is supposed to own this resource on your behalf.
In case of emergency, manual scaling is possible by disabling the auto-scaling feature.
As you can see, the Zalando operator is only responsible for managing the data nodes. Therefore, we can create the other components in our cluster, the master nodes and Kibana instances, with the Elastic’s ECK operator and perform a hybrid deployment.
We were able to create an Elasticsearch cluster on Google Kubernetes Engine (GKE) with the following yaml files and commands below:
1. Install ECK Operator’s Custom Resource Definitions
kubectl create -f https://download.elastic.co/downloads/eck/3.0.0/crds.yaml
2. Install the ECK Operator with RBAC Rules
kubectl apply -f https://download.elastic.co/downloads/eck/3.0.0/operator.yaml
You can check the operator logs and status with:
kubectl -n elastic-system logs -f statefulset.apps/elastic-operator
kubectl get -n elastic-system pods
3. Create Master Nodes (master.yaml)
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
name: test
namespace: default
spec:
version: 8.18.2
nodeSets:
- name: master
count: 3
config:
node.roles: ["master"]
node.store.allow_mmap: false
cluster.routing.allocation.awareness.attributes: []
cluster.routing.allocation.node_concurrent_incoming_recoveries: 10
cluster.routing.allocation.node_concurrent_recoveries: 10
xpack.security.authc.anonymous.username: anonymous_user
xpack.security.authc.anonymous.roles: superuser
xpack.security.authc.anonymous.authz_exception: true
podTemplate:
spec:
containers:
- name: elasticsearch
env:
- name: ES_JAVA_OPTS
value: "-Xms4g -Xmx4g"
resources:
requests:
memory: "8Gi"
cpu: "4"
limits:
memory: "8Gi"
cpu: "4"
volumeClaimTemplates:
- metadata:
name: elasticsearch-data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: "30Gi"
storageClassName: standard-rwo
http:
tls:
selfSignedCertificate:
disabled: true
Master nodes are configured with 4 CPU and 8 GB Memory. You can change these and ES_JAVA_OPTS if needed.
Apply the master.yaml file:
kubectl apply -f master.yaml
4. Clone the Zalando operator repository and deploy the operator:
git clone https://github.com/zalando-incubator/es-operator.git
cd es-operator
kubectl apply -f docs/cluster-roles.yaml
kubectl apply -f docs/zalando.org_elasticsearchdatasets.yaml
kubectl apply -f docs/zalando.org_elasticsearchmetricsets.yaml
kubectl apply -f docs/es-operator.yaml
kubectl -n es-operator-demo get pods
5. Create a file named data.yaml with the content below:
apiVersion: v1
kind: ConfigMap
metadata:
name: es-config
namespace: default
data:
elasticsearch.yml: |
cluster.name: "test"
network.host: "0.0.0.0"
bootstrap.memory_lock: false
xpack.security.authc.anonymous.username: anonymous_user
xpack.security.authc.anonymous.roles: superuser
xpack.security.authc.anonymous.authz_exception: true
discovery.seed_hosts: [test-es-transport.default.svc:9300]
cluster.initial_master_nodes: [test-es-transport.default.svc:9300]
node.roles: [data]
readiness.port: 9400
cluster.routing.allocation.node_concurrent_incoming_recoveries: 10
cluster.routing.allocation.node_concurrent_recoveries: 10
xpack.security.enabled: true
xpack.security.http.ssl.enabled: false
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.key: /usr/share/elasticsearch/config/http-secrets/tls.key
xpack.security.http.ssl.certificate: /usr/share/elasticsearch/config/http-secrets/tls.crt
xpack.security.http.ssl.certificate_authorities: /usr/share/elasticsearch/config/http-secrets/ca.crt
xpack.security.transport.ssl.key: /usr/share/elasticsearch/config/transport-secrets/tls.key
xpack.security.transport.ssl.certificate: /usr/share/elasticsearch/config/transport-secrets/tls.crt
xpack.security.transport.ssl.certificate_authorities: /usr/share/elasticsearch/config/transport-ca-secrets/ca.crt
xpack.security.transport.ssl.verification_mode: certificate
---
apiVersion: zalando.org/v1
kind: ElasticsearchDataSet
metadata:
labels:
application: elasticsearch
role: data
group: simple
name: es-data-simple
namespace: default
spec:
volumeClaimTemplates:
- metadata:
name: es-data
spec:
accessModes: [ "ReadWriteOnce" ]
resources:
requests:
storage: 10Gi
replicas: 3
scaling:
enabled: true
minReplicas: 3
maxReplicas: 4
minIndexReplicas: 2
maxIndexReplicas: 3
minShardsPerNode: 3
maxShardsPerNode: 6
scaleUpCPUBoundary: 50
scaleUpThresholdDurationSeconds: 900
scaleUpCooldownSeconds: 3600
scaleDownCPUBoundary: 25
scaleDownThresholdDurationSeconds: 1800
scaleDownCooldownSeconds: 3600
diskUsagePercentScaledownWatermark: 80
template:
metadata:
labels:
application: elasticsearch
role: data
group: simple
spec:
securityContext:
fsGroup: 1000
containers:
- name: elasticsearch
env:
- name: "node.name"
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: "node.attr.group"
value: "simple"
- name: "ES_JAVA_OPTS"
value: "-Xmx4000m -Xms4000m"
image: "docker.elastic.co/elasticsearch/elasticsearch:8.18.2"
ports:
- containerPort: 9300
name: transport
readinessProbe:
tcpSocket:
port: 9400
initialDelaySeconds: 15
periodSeconds: 10
resources:
limits:
cpu: 4
memory: 8000Mi
requests:
cpu: 4
memory: 8000Mi
volumeMounts:
- mountPath: /usr/share/elasticsearch/data
name: es-data
- name: elasticsearch-config
mountPath: /usr/share/elasticsearch/config/elasticsearch.yml
subPath: elasticsearch.yml
- name: http-secret
mountPath: /usr/share/elasticsearch/config/http-secrets
- name: transport-secret
mountPath: /usr/share/elasticsearch/config/transport-secrets
- name: transport-ca-secret
mountPath: /usr/share/elasticsearch/config/transport-ca-secrets
initContainers:
- command:
- sysctl
- -w
- vm.max_map_count=262144
image: busybox:1.30
name: init-sysctl
resources:
limits:
cpu: 50m
memory: 50Mi
requests:
cpu: 50m
memory: 50Mi
securityContext:
runAsUser: 0
privileged: true
volumes:
- name: elasticsearch-config
configMap:
name: es-config
items:
- key: elasticsearch.yml
path: elasticsearch.yml
- name: http-secret
secret:
secretName: test-es-http-certs-internal
- name: transport-secret
secret:
secretName: test-es-transport-ca-internal
- name: transport-ca-secret
secret:
secretName: test-es-master-es-transport-certs
If the you want to customize the deployment, such as changing the namespace, or changing the cluster name, you should reconfigure some of the settings such as secret names, discovery.seed_hosts, and cluster.initial_master_nodes.
Apply the data.yaml file:
kubectl apply -f data.yaml
6. Create a file named kibana.yaml:
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
name: test
spec:
version: 8.16.1
count: 1
elasticsearchRef:
name: test
Apply the kibana.yaml file:
kubectl apply -f kibana.yaml
A Few Points to Consider
- Zalando operator can’t cope with authorization. Refer to this issue. That's why the following configs are added to master and data YAML files:
xpack.security.authc.anonymous.username: anonymous_userxpack.security.authc.anonymous.roles: superuserxpack.security.authc.anonymous.authz_exception: true
- The operator does not receive releases frequently. The latest version, 0.1.5, was released in January 2025, whereas the previous version, 0.1.4, was released in March 2022. See the release notes.
- The operator has designed autoscaling based on the number of shards in the cluster. If the shard count frequently changes dramatically, using this operator may not be optimal.
- If the cluster is not in a healthy (green) state, the operator halts operations.
- To ensure high availability, place each Elasticsearch master and data node on dedicated GKE nodes distributed across multiple zones. Kubernetes taints and tolerations can enforce stricter pod placement.
- The operational requirements—disabling HTTP TLS and permitting anonymous Elasticsearch access—introduce security risks such as data interception and unauthorized access. These can be mitigated through:
- Private GKE Cluster: Nodes do not have public IP addresses, isolating the Kubernetes environment from external traffic.
- Granular Network Policies: Implement strict Kubernetes Network Policies to enforce a "zero-trust" model.
- Least-Privilege Role: Configure the anonymous user with minimal necessary permissions.
- To integrate the Zalando-managed data nodes with the ECK-managed master nodes, we implemented the following configurations:
- We enabled secure, encrypted communication by extracting the http and transport security secrets from the ECK masters and mounting them onto the Zalando data nodes. Corresponding xpack.security options were then added to the data nodes' ConfigMap to activate TLS and authentication.
- To ensure proper cluster discovery and membership, we configured the discovery.seed_hosts and cluster.initial_master_nodes parameters in the data nodes' ConfigMap to resolve to the headless service of the ECK master nodes, and explicitly set a matching cluster.name.
configs to our master and data yml files.
Example Scenario: How Autoscaling Works in Zalando Operator
Assume that your cluster has 4 indices with 6 shards, each index has 2 replicas → Total 4 indices x 6 shards x 3 copies (1 primary shard + 2 replica shards) = 72 shards. The operator settings are:
minReplicas = 2
maxReplicas = 3
minShardsPerNode = 2
maxShardsPerNode = 4
scaleUpCPUBoundary: 40
scaleUpThresholdDurationSeconds = 1200 (20 mins)
scaleUpCooldownSeconds = 3600 (1 hour)
scaleDownCPUBoundary: 25
scaleDownThresholdDurationSeconds = 1800 (30 mins)
scaleDownCooldownSeconds = 3600 (1 hour)
For initial/minimal deployment → 72 shards / 4 per node (maxShardsPerNode) = 18 nodes. If mean CPU utilization exceeds 40% (scaleUpCPUBoundary) for more than 20 mins (scaleUpThresholdDurationSeconds) scale-up process starts.
1 hour (scaleUpCooldownSeconds) wait between each scale up operation
1 hour (scaleDownCooldownSeconds) wait between each scale down operation
Some Helpful Links
- What is an Elasticsearch Index?
- Elasticsearch Node Roles
- Index Lifecycle Management (ILM)
- Kubecon Zalando Elasticsearch Operator Presentation
- Set up Elastic Stack on GKE
Authors
- Mert Yiğit Aladağ - Associate Platform & Cloud Engineering Manager, Oredata
- Atakan Tatlı - Senior Cloud Engineer, Oredata
English
Türkçe
العربية