System Performance - Docker Single Host Increasing Interrupts

Hi All,

We have recently started to ramp up the log messages going to are Graylog instance - certainly not a massive amount but I’ve found that the system cannot cope which results in system crash/ docker containers requiring a restart. I want to understand how I can find out which element is causing the behaviour and how I can identify the bottleneck.

Assuming it may still be disk load, that’s why I switch to the SSD’s on GCE, but that did not reduce the current issue.

Disclaimer, I’m not a system admin, I’ve just thrown this together so we could have a solution

Google Cloud Compute Instance
1 vCPU, 4.75 GB memory
SSD: 200GB

docker-compose file:
     version: '3.2'
 services:
   # MongoDB: https://hub.docker.com/_/mongo/
   mongo:
     image: mongo:3
     restart: unless-stopped
     volumes:
       - mongo_data:/data/db
     networks:
       - graylog
   # Elasticsearch: https://www.elastic.co/guide/en/elasticsearch/reference/6.x/docker.html
   elasticsearch:
     image: docker.elastic.co/elasticsearch/elasticsearch-oss:6.8.5
     restart: unless-stopped
     volumes:
       - es_data:/usr/share/elasticsearch/data
     environment:
       - http.host=0.0.0.0
       - transport.host=localhost
       - network.host=0.0.0.0
       - "ES_JAVA_OPTS=-Xms1536m -Xmx1536m"
     ulimits:
       memlock:
         soft: -1
         hard: -1
    #mem_limit: 1g
     networks:
       - graylog
   # Graylog: https://hub.docker.com/r/graylog/graylog/
   graylog:
     image: graylog/graylog:3.3
     volumes:
       - graylog_journal:/usr/share/graylog/data/journal
     restart: unless-stopped
     networks:
       - graylog
     depends_on:
       - mongo
       - elasticsearch
     environment:
       GRAYLOG_SERVER_JAVA_OPTS: "-Djavax.net.ssl.trustStore=/usr/share/graylog/data/config/ssl/cacerts.jks"
 
     ports:
       # Graylog https and Rest API
       - 443:443
       #- 127.0.0.1:9000:9000
       # Syslog TCP
       - 514:514
       # Syslog UDP
       - 514:514/udp
       # Syslog UDP Tag Systems
       - 515:515/udp
       # Syslog UDP for Linux Hosts
       - 1514:1514/udp
       # GELF TCP
       - 12201:12201
       # GELF UDP
       - 12201:12201/udp
     logging:
       driver: "json-file"
 
     volumes:
       # Mount local configuration directory into Docker container
       - ./graylog/config:/usr/share/graylog/data/config
       # Mount GEO Database DIR
       - ./graylog/geoip:/usr/share/graylog/data/geoip
       # Mount local plugin files into Docker container
       #- ./graylog/plugin/graylog-plugin-auth-sso-3.3.0.jar:/usr/share/graylog/plugin/graylog-plugin-auth-sso-3.3.0.jar
       - ./graylog/plugin/graylog-plugin-enterprise-integrations-3.3.7.jar:/usr/share/graylog/plugin/graylog-plugin-enterprise-integrations-3.3.7.jar
       - ./graylog/plugin/graylog-plugin-integrations-3.3.7.jar:/usr/share/graylog/plugin/graylog-plugin-integrations-3.3.7.jar
       - ./graylog/plugin/graylog-plugin-enterprise-3.3.7.jar:/usr/share/graylog/plugin/graylog-plugin-enterprise-3.3.7.jar
       # Mount local graylog enterpirses binaires files into Docker container
       - ./graylog/bin/chromedriver:/usr/share/graylog/bin/chromedriver
       - ./graylog/bin/chromedriver_start.sh:/usr/share/graylog/bin/chromedriver_start.sh
       - ./graylog/bin/headless_shell:/usr/share/graylog/bin/headless_shell
 
 volumes:
   mongo_data:
     driver: local
   es_data:
     driver: local
   graylog_journal:
     driver: local
 
 networks:
   graylog:

Docker container stats:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
5ea64500913e graylog_elasticsearch_1 3.84% 1.946GiB / 4.594GiB 42.35% 3.58GB / 1.02GB 2.09GB / 9.73GB 43
7aa916c45374 graylog_graylog_1 1.50% 1.475GiB / 4.594GiB 32.11% 6.16GB / 6.81GB 244MB / 1.45GB 227

Stats from host system:

Config of Graylog and Stats

Inputs:

GELF UDP

linux-syslog

syslog

tag-syslog

Let me know if you need any further information to point me in the right direction :slight_smile:

@tomehb I’ve gone through and edited your docker compose file to make it a bit more readable (protip, surrounding your code with ``` will help with this). Can you let me know if this is accurate? I just want to make sure before going down the road of troublshooting.

Hi Aaron,

Thanks - just reviewed and can confirm this is correct. I have now however upgraded to 4.0, but other than the versions no differences. (The issue is still present as I expected)

Thanks
Thomas

Hi @tomehb.

I’d be curious to see what the docker logs say. In general, Graylog is going to be more ram and cpu intensive than disk, so going the route of throwing an ssd at it, while fine, won’t get you what I think you want. I’m out today through the beginning of the week, so I won’t be able to devote time to trying to replicate, but I have a suspicion that you might need to throw another vcpu on that vm. As a side note, if you’re running a vm, it may be worth your time to just install Graylog on that rather than adding what I perceive to be unnecessary complexity by throwing docker into the mix.

Hi Aaron,

Noted - I’ll add another vCPU and see how if it changes and report back.

The main reason for running in docker; is so I can spin it up off cloud with my backups with ease. I may look to change this as you recommended though but still add another vCPU and any other suggested changes first :slight_smile:

Thanks
Thomas

Hi @tomehb

I’m running in docker swarm okay. 4vcpu 16GB RAM
~65 containers haven’t seen more than 600K log events in a day. All Graylog containers are on one node which also has other containers scheduled on it.

HTH

@cakiwi Thanks for the Info.

@aaronsachs
I’m now running with 2 vCPU. The issue does not appeared too of improve.

Do you think I should bump to 4 vCPU?

New Graphs:

@tomehb are you still running it in a Docker container?

@aaronsachs - yes currently, assuming you think I should ditch that straight away?

Hmmmm, I think it might be premature. What are your docker logs saying? There should at the very least be some logs for the containers.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.