Elasticsearch Cluster Overloading Issues: A Real Troubleshooting Story

View Only

Elasticsearch Cluster Overloading Issues: A Real Troubleshooting Story

By Gustavo Santos posted 03-28-2023 11:44

Recommend

Elasticsearch is a widely used open-source search engine that powers many organizational search solutions. However, manually managing an Elasticsearch cluster, whether in the cloud or on-premise, can lead to scalability issues such as circuit breakers and low search performance due to unexpected search and indexation load. This is especially problematic for multi-tenancy clusters, as it can be challenging to predict the varying load patterns from different customers each day. With Elasticsearch version 7.10 not supporting auto-scaling, adding extra hardware to prepare for peak periods is often not a viable solution due to costs. Instead, it is crucial to proactively identify, mitigate, and prevent these issues.

This blog post details a real cluster overloading issue in multi-tenancy clusters and provides practical tuning tips to optimize cluster resource utilization for a more reliable and predictable customer experience without the need to increase cluster processing power. We aim to explain why and when these tips can be applied, as well as the results after each change. It is important to note that the solution to this problem will depend on the specific use case, requiring a tentative process of trial and error rather than a magic solution.

Cluster Overloading Troubleshooting: step by step

The investigated cluster has more than 70 indices, but only two of them are responsible for more than 90% of the load. It has 3 data nodes, all of them with 16 CPU cores and 64GB of memory (32GB for heap). This cluster is characterized by the heavy indexation load it receives and low search traffic. So our focus here was to prevent the high indexation load to destabilize the entire cluster.

The screenshot below is from a period of extremely high indexation load in one of the data nodes:

The charts are indicating high usage of the heap memory, due to the indexing operations that were occurring at that moment, such as segments merges and others. The Garbage Collections (GC) were more frequent but consequently heavier and slower. Because of that, the CPU and memory usage were very high (almost reaching 100%), causing lots of circuit breakers and letting the cluster unstable for a while:

Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [cluster:monitor/xpack/enrich/coordinator_stats[n]] would be [29367612294/27.3gb], which is larger than the limit of [29071559884/27gb], real usage: [29367612232/27.3gb], new bytes reserved: [62/62b], usages [request=0/0b, fielddata=3089605155/2.8gb, in_flight_requests=6266936/5.9mb, model_inference=0/0b, accounting=668432026/637.4mb]

After analyzing the above metrics, the first suspect to look at was the GC. We wanted to ask some of these questions:

Why was it so heavy?
Why was the old generation taking too long to do its job?
Why sometimes the GC was not that efficient even taking too long to finish (spikes in heap usage were still high even after GC)?

We are using the JVM version 15 and the default GC config (considering the ones set by ES via jvm.options) were the following:

./jdk/bin/java -Xms31232m -Xmx31232m -XX:+UseG1GC -XX:G1ReservePercent=25 -XX:InitiatingHeapOccupancyPercent=30 -XX:+PrintFlagsFinal -version | grep -iE "( NewSize | MaxNewSize | OldSize | NewRatio | ParallelGCThreads | MaxGCPauseMillis | ConcGCThreads | G1HeapRegionSize ) "

     uint ConcGCThreads                            = 3                                         {product} {ergonomic}

   size_t G1HeapRegionSize                         = 16777216                                  {product} {ergonomic}

    uintx MaxGCPauseMillis                         = 200                                       {product} {default}

   size_t MaxNewSize                               = 19646119936                               {product} {ergonomic}

    uintx NewRatio                                 = 2                                         {product} {default}

   size_t NewSize                                  = 1363144                                   {product} {default}

   size_t OldSize                                  = 5452592                                   {product} {default}

     uint ParallelGCThreads                        = 13                                        {product} {default}

The size of the memory pools seemed weird to us. By default, the ratio is set to 2 (= 33% dedicated to the Young and 66% to the Old generation), which is fair for Elasticsearch. In our case, the sizes should then be 8GB for the young and 24GB for the old, however, they are not. Taking NewSize and MaxNewSize, we could see that the young generation memory pool size could be from 1310MB~ to 19GB~. The JVM ergonomically defined it (Java Platform, Standard Edition HotSpot Virtual Machine Garbage Collection Tuning Guide).

So, our first attempt was to explicitly set NewRatio=2 and also define MaxGCPauseMillis=400.

By explicitly setting NewRatio to 2, we expected to manually force the maximum young pool size as 33% of the whole Heap.
Setting MaxGCPauseMillis to 400 was done because as the default maximum allowed GC pause was low (200ms), the JVM could be triggering the GC too frequently to keep the overhead under 200ms. So, in order to reduce the GC frequency, we allowed each pause to take 400ms.

Firstly we applied the above changes in our internal environment, with a similar cluster configuration, but less indexation load.

Before the GC tuning:

After the GC tuning:

Wow, the GC count and duration were better and the CPU utilization (mainly the Cgroup one) was much lower. We thought we were good to apply the changes to the production environment 🙂

However, after applying it to the production cluster, the change had an adverse effect. It made our heap to be always at the peak and the old GC to be much more frequent, probably because setting NewRatio=2 made the young pool memory to be reduced. As a consequence, lots of circuit breakers and GC overhead logs.

As a consequence, lots of circuit breakers and GC overhead logs.

Then we decided to let the JVM define the young pool size ergonomically again, so we manually set only MaxGCPauseMillis=400. The idea was to reduce the frequency of GCs and make them more efficient. It solved the old GC collection frequency issue, but the young GC started to be very high again. As consequence, again, lots of CBs and GC overhead.

After analyzing the results, we had two important takeaways:

Clusters with high heap usage, mostly due to heavy indexation load, are not recommended to tune the mentioned GC settings. As a result, the memory usage just increased, generating many circuit breaker errors, causing even more instability to the cluster.
Tune the Young/Old memory pool size may be a good idea for clusters with stable heap usage but high CPU utilization, mostly due to high search load.

So, at this moment, we had found an optimization for the search-heavy clusters but still had not found any improvements for the indexation-heavy clusters.

The next attempt was to tune the write thread pool size. The current configuration was set to 13. As this is a fixed thread pool, we had always 13 threads available to index documents. Instead of increasing it even more, we decided to reduce the TP size to control the indexation load received by the cluster. Of course, we expected to reduce the indexation throughput in some scenarios, but our final goal was to avoid as many as possible circuit breakers. Here are the results:

Write thread pool size = 13:

- Throughput: 488 documents indexed per second.
- CPU utilization:

Write thread pool size = 6:

- Throughput: 756 documents indexed per second.
- CPU utilization:

Write thread pool size = 1:

- Throughput: 256 documents indexed per second.
- CPU utilization:

It looks like we finally have found something 🙂

We were surprised to find that the indexation throughput was higher with 6 write threads than with 13 threads. However, the most important finding was the significant difference in CPU utilization between the two configurations. As suspected, the CPU utilization was much higher with 13 write threads. Although the difference between 6 and 1 threads was small, it was still noteworthy.

Based on this experiment, we learned that limiting the size of the write thread pool is one possible solution for controlling the indexation load on the cluster. Our use cases showed that the two best options were 6 and 1 threads. For clusters with a search-heavy load and strict service level agreements (SLAs), it is advisable to choose 1 thread since the CPU usage for indexing purposes will rarely exceed 10%, leaving most of the CPU power for search operations. However, for clusters that require both search and indexation to be available at a normal level, 6 threads seem to be a good choice.

The approach to optimizing an ES cluster to prevent overloading will vary depending on the specific use case. Our investigation focused on three main scenarios: clusters with heavy indexation load, clusters with heavy search load, and clusters with strict search service level agreements (SLAs).

For clusters with a heavy indexation load, memory is likely to be the overloaded resource. In this case, reducing the size of the write thread pool is the best solution for managing the load on the cluster. The optimal thread pool sizes for us were 6 and 1, depending on how much processing power should be reserved for searching.

For clusters with a heavy search load and/or strict search SLAs, the CPU is often the bottleneck. In this scenario, two solutions can be implemented in combination. First, tune the JVM garbage collection by explicitly setting the size of the young and old generations of memory (NewRatio) and reducing the frequency of GC pauses (by increasing MaxGCPauseMillis). Second, decrease the size of the write thread pool for the same reasons as in the previous scenario. This will significantly reduce the CPU power used for indexation and prevent indexing operations from heavily impacting searches.