- Graylog: v2.4.6 (running in Docker) - 4 nodes (16 vCPUs / 14.4 GB memory each)
- ES: 5.6.1 - 3 nodes
A couple of times per day the Graylog master nose “freezes”: The Web UI becomes very slow, searches are not possible at all. “Process buffer” and “Output buffer” go to 100%, “Disk Journal Utilization” constantly increases. This state last for some minutes (sometimes 15-30 minutes). Sometimes I have to restart the application to make the Graylog Master node to process messages again.
The slave nodes don’t have this problem, only the Master Node seems to suffer from this congestion.
Relevant configuration section from docker-compose.yml
... environment: GRAYLOG_SERVER_JAVA_OPTS: "-Xms8g -Xmx8g -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:NewRatio=1 -XX:MaxMetaspaceSize=256m -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow" GRAYLOG_STREAM_PROCESSING_TIMEOUT: 3000 GRAYLOG_STREAM_PROCESSING_MAX_FAULTS: 4 GRAYLOG_PROCESSBUFFER_PROCESSORS: 20 GRAYLOG_OUTPUTBUFFER_PROCESSORS: 12 GRAYLOG_OUTPUTBUFFER_PROCESSOR_THREADS_MAX_POOL_SIZE: 64 GRAYLOG_OUTPUT_BATCH_SIZE: 1000 GRAYLOG_IS_MASTER: "true" ...
The CPU usage looks like this: So suddenly the CPU usage drops close to 0. After some time it rises again (because old messages are processed). In the attached screenshot you can see the pattern.
At about 17:30 it’s really bad: it dropped for about 30 minutes.