Oredata

How to Create an Autoscaling Elasticsearch Cluster on Google Kubernetes Engine

How to Create an Autoscaling Elasticsearch Cluster on Google Kubernetes Engine

Elasticsearch is a powerful, open-source search and analytics engine designed to handle large volumes of data quickly and in near real-time. It's built on Apache Lucene and provides a distributed, multitenant-capable full-text search engine. Some popular use cases include; website search, collecting and analyzing log data, using as vector database etc. Many businesses also leverage Elasticsearch for security information and event management (SIEM) to detect threats and monitor their systems. Ultimately, its speed and scalability make it a versatile tool for a wide range of data-driven applications.

Just like other distributed systems, setting up Elasticsearch with the right sizing is one of the important factors for its performance. If you are going to set up an on-premise cluster completely, changing the cluster size (adding/removing new nodes) can be operationally tiring. If you need to perform cluster resize at certain frequencies (for example, increasing it during campaign periods and decreasing it during off periods), then it will be inevitable to look for an alternative scenario.

To address these operational challenges, deploying an Elasticsearch cluster on Kubernetes offers a compelling and agile alternative. Kubernetes, as a container orchestration platform, inherently simplifies the process of scaling distributed systems like Elasticsearch. Your cluster can dynamically expand to handle the surge in traffic and seamlessly shrink back down during quieter times, all without manual intervention, thus optimizing resource utilization and significantly reducing the operational overhead.

For these operations, Elastic Cloud on Kubernetes (ECK) operator can be your savior. It’s the official and easiest way to deploy an autoscaling cluster on Kubernetes → https://www.elastic.co/docs/deploy-manage/deploy/cloud-on-k8s However, there is an important point to mention. The ECK Autoscaling feature comes with the Enterprise license. This means that you need to purchase a license to use this feature.

🔔 Note Elasticsearch autoscaling requires a valid Enterprise license or Enterprise trial license. Check the license documentation for more details about managing licenses.

But what if you want to proceed with a free and open-source solution without a license? In this case, Zalando's Elasticsearch operator can come to your aid. → https://github.com/zalando-incubator/es-operator

Zalando SE is a publicly traded international online retailer based in Berlin which is active across Europe and specializes in shoes, fashion and beauty products.

We can specify the properties of this operator as follows:

  • The operator works by managing custom resources called ElasticsearchDataSets (EDS). They are basically a thin wrapper around StatefulSets.
  • One EDS represents a common group of Elasticsearch data nodes.
  • It can scale in two dimensions, shards per node and number of replicas for the indices on that dataset.
  • Has been tested with Elasticsearch 7.x and 8.x.
  • The operator does not manage Elasticsearch master nodes. You can create them on your own.
  • Do not operate manually on the StatefulSet. The operator is supposed to own this resource on your behalf.

In case of emergency, manual scaling is possible by disabling the auto-scaling feature.

As you can see, the Zalando operator is only responsible for managing the data nodes. Therefore, we can create the other components in our cluster, the master nodes and Kibana instances, with the Elastic’s ECK operator and perform a hybrid deployment.

We were able to create an Elasticsearch cluster on Google Kubernetes Engine (GKE) with the following yaml files and commands below:

1. Install ECK Operator’s Custom Resource Definitions

kubectl create -f https://download.elastic.co/downloads/eck/3.0.0/crds.yaml

2. Install the ECK Operator with RBAC Rules

kubectl apply -f https://download.elastic.co/downloads/eck/3.0.0/operator.yaml

You can check the operator logs and status with:

kubectl -n elastic-system logs -f statefulset.apps/elastic-operator
kubectl get -n elastic-system pods

3. Create Master Nodes (master.yaml)

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: test
  namespace: default
spec:
  version: 8.18.2
  nodeSets:
    - name: master
      count: 3
      config:
        node.roles: ["master"]
        node.store.allow_mmap: false
        cluster.routing.allocation.awareness.attributes: []
        cluster.routing.allocation.node_concurrent_incoming_recoveries: 10
        cluster.routing.allocation.node_concurrent_recoveries: 10
        xpack.security.authc.anonymous.username: anonymous_user
        xpack.security.authc.anonymous.roles: superuser
        xpack.security.authc.anonymous.authz_exception: true
      podTemplate:
        spec:
          containers:
            - name: elasticsearch
              env:
                - name: ES_JAVA_OPTS
                  value: "-Xms4g -Xmx4g"
              resources:
                requests:
                  memory: "8Gi"
                  cpu: "4"
                limits:
                  memory: "8Gi"
                  cpu: "4"
      volumeClaimTemplates:
        - metadata:
            name: elasticsearch-data
          spec:
            accessModes: ["ReadWriteOnce"]
            resources:
              requests:
                storage: "30Gi"
            storageClassName: standard-rwo
  http:
    tls:
      selfSignedCertificate:
        disabled: true

Master nodes are configured with 4 CPU and 8 GB Memory. You can change these and ES_JAVA_OPTS if needed.

Apply the master.yaml file:

kubectl apply -f master.yaml

4. Clone the Zalando operator repository and deploy the operator:

git clone https://github.com/zalando-incubator/es-operator.git
cd es-operator
kubectl apply -f docs/cluster-roles.yaml
kubectl apply -f docs/zalando.org_elasticsearchdatasets.yaml
kubectl apply -f docs/zalando.org_elasticsearchmetricsets.yaml
kubectl apply -f docs/es-operator.yaml
kubectl -n es-operator-demo get pods

5. Create a file named data.yaml with the content below:

apiVersion: v1
kind: ConfigMap
metadata:
  name: es-config
  namespace: default
data:
  elasticsearch.yml: |
    cluster.name: "test"
    network.host: "0.0.0.0"
    bootstrap.memory_lock: false
    xpack.security.authc.anonymous.username: anonymous_user
    xpack.security.authc.anonymous.roles: superuser
    xpack.security.authc.anonymous.authz_exception: true
    discovery.seed_hosts: [test-es-transport.default.svc:9300]
    cluster.initial_master_nodes: [test-es-transport.default.svc:9300]
    node.roles: [data]
    readiness.port: 9400
    cluster.routing.allocation.node_concurrent_incoming_recoveries: 10
    cluster.routing.allocation.node_concurrent_recoveries: 10
    xpack.security.enabled: true
    xpack.security.http.ssl.enabled: false
    xpack.security.transport.ssl.enabled: true
    xpack.security.http.ssl.key: /usr/share/elasticsearch/config/http-secrets/tls.key
    xpack.security.http.ssl.certificate: /usr/share/elasticsearch/config/http-secrets/tls.crt
    xpack.security.http.ssl.certificate_authorities: /usr/share/elasticsearch/config/http-secrets/ca.crt
    xpack.security.transport.ssl.key: /usr/share/elasticsearch/config/transport-secrets/tls.key
    xpack.security.transport.ssl.certificate: /usr/share/elasticsearch/config/transport-secrets/tls.crt
    xpack.security.transport.ssl.certificate_authorities: /usr/share/elasticsearch/config/transport-ca-secrets/ca.crt
    xpack.security.transport.ssl.verification_mode: certificate
---
apiVersion: zalando.org/v1
kind: ElasticsearchDataSet
metadata:
  labels:
    application: elasticsearch
    role: data
    group: simple
  name: es-data-simple
  namespace: default
spec:
  volumeClaimTemplates:
  - metadata:
      name: es-data
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 10Gi
  replicas: 3
  scaling:
    enabled: true
    minReplicas: 3
    maxReplicas: 4
    minIndexReplicas: 2
    maxIndexReplicas: 3
    minShardsPerNode: 3
    maxShardsPerNode: 6
    scaleUpCPUBoundary: 50
    scaleUpThresholdDurationSeconds: 900
    scaleUpCooldownSeconds: 3600
    scaleDownCPUBoundary: 25
    scaleDownThresholdDurationSeconds: 1800
    scaleDownCooldownSeconds: 3600
    diskUsagePercentScaledownWatermark: 80
  template:
    metadata:
      labels:
        application: elasticsearch
        role: data
        group: simple
    spec:
      securityContext:
        fsGroup: 1000
      containers:
      - name: elasticsearch
        env:
        - name: "node.name"
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: "node.attr.group"
          value: "simple"
        - name: "ES_JAVA_OPTS"
          value: "-Xmx4000m -Xms4000m"
        image: "docker.elastic.co/elasticsearch/elasticsearch:8.18.2"
        ports:
        - containerPort: 9300
          name: transport
        readinessProbe:
          tcpSocket:
            port: 9400
          initialDelaySeconds: 15
          periodSeconds: 10
        resources:
          limits:
            cpu: 4
            memory: 8000Mi
          requests:
            cpu: 4
            memory: 8000Mi
        volumeMounts:
        - mountPath: /usr/share/elasticsearch/data
          name: es-data
        - name: elasticsearch-config
          mountPath: /usr/share/elasticsearch/config/elasticsearch.yml
          subPath: elasticsearch.yml
        - name: http-secret
          mountPath: /usr/share/elasticsearch/config/http-secrets
        - name: transport-secret
          mountPath: /usr/share/elasticsearch/config/transport-secrets
        - name: transport-ca-secret
          mountPath: /usr/share/elasticsearch/config/transport-ca-secrets
      initContainers:
      - command:
        - sysctl
        - -w
        - vm.max_map_count=262144
        image: busybox:1.30
        name: init-sysctl
        resources:
          limits:
            cpu: 50m
            memory: 50Mi
          requests:
            cpu: 50m
            memory: 50Mi
        securityContext:
          runAsUser: 0
          privileged: true
      volumes:
      - name: elasticsearch-config
        configMap:
          name: es-config
          items:
          - key: elasticsearch.yml
            path: elasticsearch.yml
      - name: http-secret
        secret:
          secretName: test-es-http-certs-internal
      - name: transport-secret
        secret:
          secretName: test-es-transport-ca-internal
      - name: transport-ca-secret
        secret:
          secretName: test-es-master-es-transport-certs

If the you want to customize the deployment, such as changing the namespace, or changing the cluster name, you should reconfigure some of the settings such as secret names, discovery.seed_hosts, and cluster.initial_master_nodes.

Apply the data.yaml file:

kubectl apply -f data.yaml

6. Create a file named kibana.yaml:

apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: test
spec:
  version: 8.16.1
  count: 1
  elasticsearchRef:
    name: test

Apply the kibana.yaml file:

kubectl apply -f kibana.yaml

A Few Points to Consider

  • Zalando operator can’t cope with authorization. Refer to this issue. That's why the following configs are added to master and data YAML files:
    • xpack.security.authc.anonymous.username: anonymous_user
    • xpack.security.authc.anonymous.roles: superuser
    • xpack.security.authc.anonymous.authz_exception: true
  • configs to our master and data yml files.

  • The operator does not receive releases frequently. The latest version, 0.1.5, was released in January 2025, whereas the previous version, 0.1.4, was released in March 2022. See the release notes.
  • The operator has designed autoscaling based on the number of shards in the cluster. If the shard count frequently changes dramatically, using this operator may not be optimal.
  • If the cluster is not in a healthy (green) state, the operator halts operations.
  • To ensure high availability, place each Elasticsearch master and data node on dedicated GKE nodes distributed across multiple zones. Kubernetes taints and tolerations can enforce stricter pod placement.
  • The operational requirements—disabling HTTP TLS and permitting anonymous Elasticsearch access—introduce security risks such as data interception and unauthorized access. These can be mitigated through:
    • Private GKE Cluster: Nodes do not have public IP addresses, isolating the Kubernetes environment from external traffic.
    • Granular Network Policies: Implement strict Kubernetes Network Policies to enforce a "zero-trust" model.
    • Least-Privilege Role: Configure the anonymous user with minimal necessary permissions.
  • To integrate the Zalando-managed data nodes with the ECK-managed master nodes, we implemented the following configurations:
    • We enabled secure, encrypted communication by extracting the http and transport security secrets from the ECK masters and mounting them onto the Zalando data nodes. Corresponding xpack.security options were then added to the data nodes' ConfigMap to activate TLS and authentication.
    • To ensure proper cluster discovery and membership, we configured the discovery.seed_hosts and cluster.initial_master_nodes parameters in the data nodes' ConfigMap to resolve to the headless service of the ECK master nodes, and explicitly set a matching cluster.name.

Example Scenario: How Autoscaling Works in Zalando Operator

Assume that your cluster has 4 indices with 6 shards, each index has 2 replicas → Total 4 indices x 6 shards x 3 copies (1 primary shard + 2 replica shards) = 72 shards. The operator settings are:

minReplicas = 2
maxReplicas = 3
minShardsPerNode = 2
maxShardsPerNode = 4
scaleUpCPUBoundary: 40 
scaleUpThresholdDurationSeconds = 1200 (20 mins)
scaleUpCooldownSeconds = 3600 (1 hour)
scaleDownCPUBoundary: 25
scaleDownThresholdDurationSeconds = 1800 (30 mins)
scaleDownCooldownSeconds = 3600 (1 hour)

For initial/minimal deployment → 72 shards / 4 per node (maxShardsPerNode) = 18 nodes. If mean CPU utilization exceeds 40% (scaleUpCPUBoundary) for more than 20 mins (scaleUpThresholdDurationSeconds) scale-up process starts.

  • First by decreasing the shards-per-node ratio to 3 → 72 shards / 3 per node = 24 nodes
  • Scale up by decreasing the shards-per-node ratio to 2 (minShardsPerNode): 72 shards / 2 per-node = 36 nodes
  • Scale up by increasing replicas. 4 indices x 6 shards x 4 copies (1 primary shard + 3 replicas (maxReplicas)) = 96 shards / 2 per node = 48 nodes
  • No more further scale-up (safety net to avoid cost explosion)
  • 1 hour (scaleUpCooldownSeconds) wait between each scale up operation

  • Scale down operation is in a reverse order. So, if expected average CPU utilization would be below 25% (scaleDownCPUBoundary), decrease replica count to 2 → 72 shards (4 indices x 6 shards x 3 copies (1 primary shard + 2 replica shards)) / 2 per-node = 36 nodes
  • Scale down by increasing the shards-per-node ratio to 3 → 72 shards / 3 per-node = 24 nodes
  • Scale down by increasing the shards-per-node ratio to 4 (maxShardsPerNode) - 72 shards / 4 per-node = 18 nodes.
  • 1 hour (scaleDownCooldownSeconds) wait between each scale down operation

    Some Helpful Links

    Authors

    • Mert Yiğit Aladağ - Associate Platform & Cloud Engineering Manager, Oredata
    • Atakan Tatlı - Senior Cloud Engineer, Oredata