Graylog Master host freezes from time to time

My setup:

  • Graylog: v2.4.6 (running in Docker) - 4 nodes (16 vCPUs / 14.4 GB memory each)
  • ES: 5.6.1 - 3 nodes

A couple of times per day the Graylog master nose “freezes”: The Web UI becomes very slow, searches are not possible at all. “Process buffer” and “Output buffer” go to 100%, “Disk Journal Utilization” constantly increases. This state last for some minutes (sometimes 15-30 minutes). Sometimes I have to restart the application to make the Graylog Master node to process messages again.

The slave nodes don’t have this problem, only the Master Node seems to suffer from this congestion.

Relevant configuration section from docker-compose.yml

  ...
  environment:
  GRAYLOG_SERVER_JAVA_OPTS: "-Xms8g -Xmx8g -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap -XX:NewRatio=1 -XX:MaxMetaspaceSize=256m -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow"
  
  GRAYLOG_STREAM_PROCESSING_TIMEOUT: 3000
  GRAYLOG_STREAM_PROCESSING_MAX_FAULTS: 4
  GRAYLOG_PROCESSBUFFER_PROCESSORS: 20
  GRAYLOG_OUTPUTBUFFER_PROCESSORS: 12
  GRAYLOG_OUTPUTBUFFER_PROCESSOR_THREADS_MAX_POOL_SIZE: 64
  GRAYLOG_OUTPUT_BATCH_SIZE: 1000
  GRAYLOG_IS_MASTER: "true"
  ...

The CPU usage looks like this: So suddenly the CPU usage drops close to 0. After some time it rises again (because old messages are processed). In the attached screenshot you can see the pattern.

At about 17:30 it’s really bad: it dropped for about 30 minutes.

  GRAYLOG_PROCESSBUFFER_PROCESSORS: 20
  GRAYLOG_OUTPUTBUFFER_PROCESSORS: 12

Input, Process and Output buffer processor should be together not more than the available cores of a system. I guess that this freeze happens during the GC run - your logfile might tell you.

Regarding the freezes: For what would I have to look in the logfile?

This *_processors settings seem to be some kind of voodoo black magic :wink:

E.g. I found this here, where they use

processbuffer_processors = 72
outputbuffer_processors = 128

for a 16 core machine.

“Increasing the output batch size did negatively impact the performance so we reduced the batch size to 100, which is 1/10th of the default value. At the same time we increased the processbuffer_processors and outputbuffer_processors quite a lot. So the current settings, that seem to be a sweet spot are really the opposite to the best practices.

Regarding the freezes: For what would I have to look in the logfile?

You need to watch for the Garbage Collection Settings or similar.

This *_processors settings seem to be some kind of voodoo black magic

It is not - each processor creates a new thread in the JVM that is used to process the messages. In the default settings, ~200 threads are created in the JVM for various tasks. If you put 128 processors for the output buffer into this game you also have 128 possible connections from Graylog to elasticsearch. What will overwhelm Elasticsearch with connections leaving no jvm threads to other workers that are needed for other functions!

The point is, you want big packages (batch_size) at not too much connections (outputbuffer_processor) - having small chunks in hundreds of connection will make the need of resources in elasticsearch higher.

Do what you think is the right way - I told you mine.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.