Graylog Master host freezes from time to time

hangstl · October 11, 2018, 1:10pm

My setup:

Graylog: v2.4.6 (running in Docker) - 4 nodes (16 vCPUs / 14.4 GB memory each)
ES: 5.6.1 - 3 nodes

A couple of times per day the Graylog master nose “freezes”: The Web UI becomes very slow, searches are not possible at all. “Process buffer” and “Output buffer” go to 100%, “Disk Journal Utilization” constantly increases. This state last for some minutes (sometimes 15-30 minutes). Sometimes I have to restart the application to make the Graylog Master node to process messages again.

The slave nodes don’t have this problem, only the Master Node seems to suffer from this congestion.

Relevant configuration section from docker-compose.yml

  ...
  environment:
  GRAYLOG_SERVER_JAVA_OPTS: "-Xms8g -Xmx8g -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:NewRatio=1 -XX:MaxMetaspaceSize=256m -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow"
  
  GRAYLOG_STREAM_PROCESSING_TIMEOUT: 3000
  GRAYLOG_STREAM_PROCESSING_MAX_FAULTS: 4
  GRAYLOG_PROCESSBUFFER_PROCESSORS: 20
  GRAYLOG_OUTPUTBUFFER_PROCESSORS: 12
  GRAYLOG_OUTPUTBUFFER_PROCESSOR_THREADS_MAX_POOL_SIZE: 64
  GRAYLOG_OUTPUT_BATCH_SIZE: 1000
  GRAYLOG_IS_MASTER: "true"
  ...

The CPU usage looks like this: So suddenly the CPU usage drops close to 0. After some time it rises again (because old messages are processed). In the attached screenshot you can see the pattern.

At about 17:30 it’s really bad: it dropped for about 30 minutes.

jan · October 11, 2018, 5:23pm

  GRAYLOG_PROCESSBUFFER_PROCESSORS: 20
  GRAYLOG_OUTPUTBUFFER_PROCESSORS: 12

Input, Process and Output buffer processor should be together not more than the available cores of a system. I guess that this freeze happens during the GC run - your logfile might tell you.

hangstl · October 12, 2018, 7:24am

Regarding the freezes: For what would I have to look in the logfile?

This *_processors settings seem to be some kind of voodoo black magic

E.g. I found this here, where they use

processbuffer_processors = 72
outputbuffer_processors = 128

for a 16 core machine.

“Increasing the output batch size did negatively impact the performance so we reduced the batch size to 100, which is 1/10th of the default value. At the same time we increased the processbuffer_processors and outputbuffer_processors quite a lot. So the current settings, that seem to be a sweet spot are really the opposite to the best practices.”

jan · October 12, 2018, 10:03am

Regarding the freezes: For what would I have to look in the logfile?

You need to watch for the Garbage Collection Settings or similar.

This *_processors settings seem to be some kind of voodoo black magic

It is not - each processor creates a new thread in the JVM that is used to process the messages. In the default settings, ~200 threads are created in the JVM for various tasks. If you put 128 processors for the output buffer into this game you also have 128 possible connections from Graylog to elasticsearch. What will overwhelm Elasticsearch with connections leaving no jvm threads to other workers that are needed for other functions!

The point is, you want big packages (batch_size) at not too much connections (outputbuffer_processor) - having small chunks in hundreds of connection will make the need of resources in elasticsearch higher.

Do what you think is the right way - I told you mine.

Topic		Replies	Views
Status Green, all systems go. How to optimize? Graylog Central (peer support)	15	5015	August 25, 2017
Graylog output will stop Graylog Central (peer support)	9	501	August 31, 2023
Buffer Configuration Issue Graylog Central (peer support)	7	2357	June 17, 2021
Process Buffer Flooding 100% process Graylog Central (peer support)	7	4848	April 23, 2020
Performance advice. I'm missing something Graylog Central (peer support) sidecar , nxlog , nodatanx	21	9347	February 8, 2019

Graylog Master host freezes from time to time

Related topics