Championing Graylog and need performance advice

Hey everyone. I was hoping to get some advice regarding lack luster performance with my current setup. I have setup a Graylog cluster with 2 Graylog nodes running in docker (CE) swarm mode and two ES nodes running on RHEL 7 (a 3rd ES node and third Swarm node is on its way, provisioning just got delayed).

To make this post easier to read, I’ll put the details at the bottom. Basically, under light load (0-100 Message/sec) everything runs great. Queries are fast, input buffers are zero and the kafka journal stays below 500 messages. Under heavier load (200-500 messages /sec), things start to get awry. I have seen the output buffer go to 100% and restart the Graylog server, and the disk journal filled to 200,000+ before I shut off the input. I am really just trying to get a grip on how to properly diagnose what’s going on. Any help would be greatly appreciated.

From what I have observed, our system, on occasion, sends large log messages (around 10Mb max). I currently believe this is the culprit, and is putting large back pressure on the output buffer. I have code going out soon to limit these logs in size. To be 100% honest, I’m not really sure what the desired/nominal ranges are for many of these metrics. Possibly a great addition to the docs would be expected nominal metric statistics for common loads. Any input on this would also be very much appreciated.

Input Config
1 - Global GELF UDP Input on 12201 (1 on master, 1 on slave)
Messages are being sent as JSON over GELF UDP. Using JSON extractor to parse and create all fields.

Metrics of Concern

org.graylog2.outputs.BlockingBatchedESOutput.batchSize
Histogram
95th percentile:
19
98th percentile:
23
99th percentile:
26
Standard deviation:
5
Mean:
4
Minimum:
1
Maximum:
50
Count:
12,971

 org.graylog2.outputs.BlockingBatchedESOutput.bufferFlushes
Meter
Total:
12,971 events
Mean:
0.31 events/second
1 minute avg:
0.41 events/second
5 minute avg:
0.4 events/second
15 minute avg:
0.41 events/second

 org.graylog2.outputs.BlockingBatchedESOutput.bufferFlushesRequested
Meter
Total:
14,886 events
Mean:
0.36 events/second
1 minute avg:
0.7 events/second
5 minute avg:
0.7 events/second
15 minute avg:
0.69 events/second
 org.graylog2.outputs.BlockingBatchedESOutput.bufferFlushFailures
Meter
Total:
0 events
Mean:
0 events/second
1 minute avg:
0 events/second
5 minute avg:
0 events/second
15 minute avg:
0 events/second

org.graylog2.outputs.BlockingBatchedESOutput.processTime
Timer
95th percentile:
264,434μs
98th percentile:
463,344μs
99th percentile:
463,344μs
Standard deviation:
97,894μs
Mean:
64,884μs
Minimum:
5,827μs
Maximum:
1,555,025μs


Custom Mapping - Prevent 32kb ES Limit Error

curl -X PUT http://localhost:9200/_template/graylog-custom-mapping?pretty -d '
{
  "template": "graylog_*",
  "mappings" : {
    "message" : {
      "properties" : {
        "requestContent" : {
          "type" : "string",
          "index" : "no",
          "doc_values": false
        }
      }
    }
  }
}'

Plugins
1 - Forked Slack Plugin by aenima4six2 (me).
2 - Aggregate Plugin
3 - HTTP Plugin

Graylog/Docker Setup (Note, 3rd ES Server is being provisioned soon to avoid Split Brain)

version: '3.1'
services:
  mongodb-1:
    image: **mongo:latest based image**
    volumes:
      - ./data/mongodb-1:/data/db
    deploy:
      restart_policy:
        condition: on-failure
      placement:
        constraints:
          - node.role == worker
          - node.labels.os_type == linux
          - node.labels.db.mongo == mongodb-1
  mongodb-2:
    image: **mongo:latest based image**
    volumes:
      - ./data/mongodb-2:/data/db
    deploy:
      restart_policy:
        condition: on-failure
      placement:
        constraints:
          - node.role == worker
          - node.labels.os_type == linux
          - node.labels.db.mongo == mongodb-2
  mongodb-arbiter:
    image: **mongo:latest based image**
    deploy:
      restart_policy:
        condition: on-failure
      placement:
        constraints:
          - node.role == worker
          - node.labels.os_type == linux
          - node.labels.db.mongo == mongodb-arbiter

  # MongoDB Replica Init Container
  mongodb-init:
    image: **mongo:latest based image**
    depends_on:
      - mongodb-1
      - mongodb-2
      - mongodb-arbiter
    deploy:
      restart_policy:
        condition: on-failure
      placement:
        constraints:
          - node.role == worker
          - node.labels.os_type == linux

  graylog-master:
    image: **graylog2/server:2.3.1-1 based image**
    environment:
      GRAYLOG_SERVER_JAVA_OPTS: '-Xms4g -Xmx4g -XX:NewRatio=1 -XX:MaxMetaspaceSize=256m -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow'
      GRAYLOG_PASSWORD_SECRET: **redacted**
      GRAYLOG_ROOT_PASSWORD_SHA2: **redacted**
      GRAYLOG_WEB_LISTEN_URI: http://0.0.0.0:9000/
      GRAYLOG_REST_LISTEN_URI: http://0.0.0.0:9000/api/ 
      GRAYLOG_WEB_ENDPOINT_URI: http://graylog-master:9000/api/
      GRAYLOG_REST_TRANSPORT_URI: http://graylog-master:9000/api/
      GRAYLOG_REST_ENABLE_TLS: 'false'
      GRAYLOG_WEB_ENABLE_TLS: 'false'
      GRAYLOG_MONGODB_URI: mongodb://mongodb-1:27017,mongodb-2:27017/graylog?replicaSet=graylog
      GRAYLOG_ELASTICSEARCH_SHARDS: 6
      GRAYLOG_ELASTICSEARCH_REPLICAS: 1
      GRAYLOG_ELASTICSEARCH_HOSTS: 'http://**REDACTED**-1:9200,http://**REDACTED**-2:9200'
      GRAYLOG_IS_MASTER: 'true'
      GRAYLOG_WEB_ENABLE: 'true'
    volumes:
      - ./data/master/journal:/usr/share/graylog/data/journal
    ports:
      - "12201:12201/udp"
    deploy:
      restart_policy:
        condition: on-failure
      placement:
        constraints:
          - node.role == worker
          - node.labels.os_type == linux
          - node.labels.app.graylog == graylog-master
    depends_on:
      - mongodb-1
      - mongodb-2
      - mongodb-arbiter

  graylog-slave:
    image: **graylog2/server:2.3.1-1 based image**
    entrypoint: /wait-for-it.sh graylog-master:9000 -t 60 -- /docker-entrypoint.sh graylog
    environment:
      GRAYLOG_SERVER_JAVA_OPTS: '-Xms4g -Xmx4g -XX:NewRatio=1 -XX:MaxMetaspaceSize=256m -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow'
      GRAYLOG_PASSWORD_SECRET: **redacted**
      GRAYLOG_ROOT_PASSWORD_SHA2: **redacted**
      GRAYLOG_WEB_LISTEN_URI: http://0.0.0.0:9000/
      GRAYLOG_REST_LISTEN_URI: http://0.0.0.0:9000/api/ 
      GRAYLOG_WEB_ENDPOINT_URI: http://graylog-slave:9000/api/
      GRAYLOG_REST_TRANSPORT_URI: http://graylog-slave:9000/api/
      GRAYLOG_REST_ENABLE_TLS: 'false'
      GRAYLOG_WEB_ENABLE_TLS: 'false'
      GRAYLOG_MONGODB_URI: mongodb://mongodb-1:27017,mongodb-2:27017/graylog?replicaSet=graylog
      GRAYLOG_ELASTICSEARCH_SHARDS: 6
      GRAYLOG_ELASTICSEARCH_REPLICAS: 1
      GRAYLOG_ELASTICSEARCH_HOSTS: 'http://**REDACTED**-1:9200,http://**REDACTED**-2:9200'
      GRAYLOG_IS_MASTER: 'false'
      GRAYLOG_WEB_ENABLE: 'true'
    volumes:
      - ./data/slave/journal:/usr/share/graylog/data/journal
    ports:
      - "12202:12201/udp"
    deploy:
      restart_policy:
        condition: on-failure
      placement:
        constraints:
          - node.role == worker
          - node.labels.os_type == linux
          - node.labels.app.graylog == graylog-slave
    depends_on:
      - mongodb-1
      - mongodb-2
      - mongodb-arbiter
  nginx:
    image: **nginx:latest based image**
    deploy:
      mode: replicated
      replicas: 2
      restart_policy:
        condition: on-failure
      placement:
        constraints:
          - node.role == worker
          - node.labels.os_type == linux
    depends_on:
      - graylog-slave
      - graylog-master
    ports:
      - "12200:12200/tcp"
      - "9000:9000"

Graylog Config - Comments Removed

node_id_file = /usr/share/graylog/data/config/node-id
plugin_dir = /usr/share/graylog/plugin
rest_listen_uri = http://0.0.0.0:9000/api/
rest_enable_cors = true
web_listen_uri = http://0.0.0.0:9000/
web_enable_cors = true
elasticsearch_hosts = http://elasticsearch:9200
elasticsearch_compression_enabled = false
allow_leading_wildcard_searches = true
allow_highlighting = false
output_batch_size = 5000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 10
outputbuffer_processors = 10
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /usr/share/graylog/data/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://mongo/graylog
mongodb_max_connections = 100
mongodb_threads_allowed_to_block_multiplier = 5
content_packs_loader_enabled = true
content_packs_dir = /usr/share/graylog/data/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

Graylog Server Specs - x2
Type: VM (VMWare)
Container Service: Docker CE - Swarm Mode
OS: RHEL 7
CPU: 8 Core
Memory: 8Gb
Disk: 350Mbps write / 1200 Mbs read SAN

Elasticsearch Server Specs - x2
Type: VM (VMWare)
Container Service: None
OS: RHEL 7
CPU: 4 Core
Memory: 15 Gb
Disk: 350Mbps write / 1200 Mbs read SAN

what is your configured java heap for graylog and elasticsearch?

In addition you should lower the outputbuffer_processors because if elasticsearch take longer than 1 second to successful store the messages the connection max. of elasticsearch will be reached very fast when you ingest with two graylog servers.

And you should lower the number of cores for graylog but raise the number of cores for elasticsearch.

Thanks for taking a look at this @jan!

Both are 50% of physical.
Graylog is -Xms4g -Xmx4g and ES is -Xms7g -Xmx7g

The rational for increasing buffers was from the config docs.

The number of parallel running processors.
Raise this number if your buffers are filling up.
processbuffer_processors = 10
outputbuffer_processors = 10

I will drop outputbuffer_processors back to its default.

We also allocated CPU and Memory based of the docs…
Graylog nodes should have a focus on CPU power. These also serve the user interface to the browser.
Elasticsearch nodes should have as much RAM as possible and the fastest disks you can get. Everything depends on I/O speed here.

Are these recommendations in the docs incorrect? Do you think 4 cores each on GL and 8 Cores on ES ?

hej @Aenima4six2

it is not easy to answer if the recommendation in the documentation is wrong. Every setup is different and that is just a recommendation not a hard rule. In addition it should only give you the idea where the focus of each part of a Graylog setup should be.

For a detailed analysis of your setup you might consider professional services that can you help you far more in detail.

Hey @jan. That’s fair. Professional services/Enterprise is a possibility if the current initiative to champion Graylog is successful. Its a bit of a chicken and the egg situation, but I won’t be able to justify spending budget on pro services unless I can create some demand first with the base product.

Also, do you recommend increasing ring_size, say to 131072?

That depends on a lot of variables, e. g. number of processing threads, size of the messages, amount of memory, etc.

@jochen I have most of that documented.

Large messages. 8gb per server.

What exactly are “large” messages in numbers?

@jochen [quote=“Aenima4six2, post:1, topic:2309”]
around 10Mb max
[/quote]

Average message is pretty small, however, we log request bodies and in some instances they are up to 10Mb. I have code going out soon to limit the size of our logged request body to 4096 Characters.

So in the worst case you have a (naively calculated) memory requirement of $ring_size * $max_message_size for each ring buffer (without including any overhead due to the memory structures of a Java object).

If you set ring_size to 131072, that would be 131072 * 10 MB == 1,310,720 MB (or roughly 1.3 TB) in the worst case. Of course you can calculate with average sizes, but then you have to live with the fact that Graylog will stop working if the worst case occurs.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.