Graylog 2.4 Cluster hangs up

1. Describe your incident:
On every Thursday 11PM IST all graylog server cpu usage peaks and gradually comes to normal.This event occurs in the timeframe of 11:02 - 11:05.

  • At the time Graylog GUI cant be accessed.
  • Utilisation and services is not registered in grafana.
  • When logged is server it is in hanged state.Any commands i.e top cant be executed to check to issue.

2. Describe your environment:

  • OS Information: Centos 7

  • Package Version:

    • Graylog 2.4.3
    • ES Version 5.6.0
    • Mongo 4.2.12
  • Service logs, configurations, and environment variables:

    • 200Messages/sec and daily logs of 15 -20 GB

    • GLog server 1 :
      CPU - 16
      RAM - 64 GB

    • GLog server 2 :
      CPU - 16
      RAM - 34 GB

    • GLog server 3 :
      CPU - 16
      RAM - 34 GB

3. What steps have you already taken to try and solve the problem?

Checked for diskspace.Increased drive space by 500GB on each server

Increased Elasticsearch JVM from 1G to 4G

On observing mongodb logs i can find the repel set is going down but no idea what causes it.

Logs of Graylog server 2 Mongodb:

2022-03-11T00:36:33.040+0700 I  REPL     [replexec-38556] Member ncc-prod-glog:27017 is now in state RS_DOWN - Request 65639798 timed out, deadline was 2022-03-11T00:32:06.26
3+0700, op was RemoteCommand 65639798 -- target:[ncc-prod-glog:27017] db:admin expDate:2022-03-11T00:32:06.263+0700 cmd:{ replSetHeartbeat: "rs0", configVersion: 3, hbv: 1, f
rom: "ncc-prod-glogdb1:27017", fromId: 0, term: 19 }
2022-03-11T00:36:33.040+0700 I  REPL     [replexec-38559] Member ncc-prod-glogdb2:27017 is now in state RS_DOWN - no response within election timeout period
2022-03-11T00:36:33.040+0700 I  REPL     [replexec-38559] can't see a majority of the set, relinquishing primary
2022-03-11T00:36:33.040+0700 I  REPL     [replexec-38559] Stepping down from primary in response to heartbeat
2022-03-11T00:36:33.040+0700 I  REPL     [RstlKillOpThread] Starting to kill user operations
2022-03-11T00:36:33.040+0700 I  REPL     [RstlKillOpThread] Stopped killing user operations
2022-03-11T00:36:33.040+0700 I  REPL     [RstlKillOpThread] State transition ops metrics: { lastStateTransition: "stepDown", userOpsKilled: 0, userOpsRunning: 0 }
2022-03-11T00:36:33.040+0700 I  REPL     [replexec-38559] transition to SECONDARY from PRIMARY
2022-03-11T00:36:33.047+0700 I  CONNPOOL [Replication] Ending connection to host ncc-prod-glog:27017 due to bad connection status: CallbackCanceled: Callback was canceled; 0
connections to that host remain open
2022-03-11T00:36:33.047+0700 I  CONNPOOL [Replication] Connecting to ncc-prod-glog:27017
2022-03-11T00:36:33.441+0700 I  NETWORK  [conn1949] end connection 10.100.144.46:56254 (37 connections now open)
2022-03-11T00:36:39.159+0700 I  NETWORK  [conn1944] end connection 10.100.144.41:33948 (36 connections now open)
2022-03-11T00:36:39.159+0700 I  NETWORK  [conn1938] end connection 10.100.144.41:33938 (35 connections now open)
2022-03-11T00:36:40.061+0700 I  CONNPOOL [Replication] Connecting to ncc-prod-glog:27017
2022-03-11T00:36:43.736+0700 I  ELECTION [replexec-38556] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majori
ty (mask 0x1)
2022-03-11T00:36:47.077+0700 I  CONNPOOL [Replication] Connecting to ncc-prod-glog:27017
2022-03-11T00:36:53.040+0700 I  CONNPOOL [Replication] Connecting to ncc-prod-glogdb2:27017
2022-03-11T00:36:54.093+0700 I  CONNPOOL [Replication] Connecting to ncc-prod-glog:27017
2022-03-11T00:36:54.412+0700 I  ELECTION [replexec-38556] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majori
ty (mask 0x1)
2022-03-11T00:36:55.540+0700 I  REPL_HB  [replexec-38559] Heartbeat to ncc-prod-glog:27017 failed after 2 retries, response status: NetworkInterfaceExceededTimeLimit: Couldn'
t get a connection within the time limit
2022-03-11T00:37:00.053+0700 I  CONNPOOL [Replication] Connecting to ncc-prod-glogdb2:27017
2022-03-11T00:37:01.109+0700 I  CONNPOOL [Replication] Connecting to ncc-prod-glog:27017
2022-03-11T00:37:05.351+0700 I  ELECTION [replexec-38559] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majori
ty (mask 0x1)
2022-03-11T00:37:07.069+0700 I  CONNPOOL [Replication] Connecting to ncc-prod-glogdb2:27017
2022-03-11T00:37:08.125+0700 I  CONNPOOL [Replication] Connecting to ncc-prod-glog:27017
2022-03-11T00:37:10.647+0700 I  NETWORK  [listener] connection accepted from 10.100.144.47:43276 #1962 (36 connections now open)
2022-03-11T00:37:10.648+0700 I  NETWORK  [conn1962] received client metadata from 10.100.144.47:43276 conn1962: { driver: { name: "NetworkInterfaceTL", version: "4.2.12" }, o
s: { type: "Linux", name: "CentOS Linux release 7.9.2009 (Core)", architecture: "x86_64", version: "Kernel 3.10.0-1160.31.1.el7.x86_64" } }
2022-03-11T00:37:10.690+0700 I  NETWORK  [conn1937] end connection 10.100.144.47:21884 (35 connections now open)
2022-03-11T00:37:10.690+0700 I  NETWORK  [conn1955] end connection 10.100.144.47:28042 (34 connections now open)
2022-03-11T00:37:10.690+0700 I  NETWORK  [listener] connection accepted from 10.100.144.47:43278 #1963 (35 connections now open)
2022-03-11T00:37:10.690+0700 I  NETWORK  [listener] connection accepted from 10.100.144.47:43280 #1964 (36 connections now open)
2022-03-11T00:37:10.691+0700 I  NETWORK  [conn1963] received client metadata from 10.100.144.47:43278 conn1963: { driver: { name: "NetworkInterfaceTL", version: "4.2.12" },

4. How can the community help?

Does anyone have a suggestion on what could be the culprit?

To add to it as a step of log4j mitigation (Set the JVM option 1.3k -Dlog4j2.formatMsgNoLookups=true
) was also done on a thursday same time.Does that co-relate to this weekly issue.

Thanks in advance for your help!

Hello @groovyghost

Need to ask a couple questions.

  • What does the configuration for Graylog/MongoDb/Elasticsearch cluster (i.e., configuration files )?
  • By chance are you monitoring the network from all nodes? Looking for a disturbance in the force.
  • What other software do you have running around that time in your environment?
  • Is Elasticsearch rotating Index during this time?

Correct me if I’m wrong, you have three nodes with Graylog/Elasticsearch/MongoDb on each? If so, then you would have a three MongoDb node cluster? If this is correct, then perhaps the following will help.

What is your rs.status() on MongoDb when this issue happens?

You can increase your heartbeat.

settings: {
    chainingAllowed : <boolean>,
    heartbeatIntervalMillis : <int>,
    heartbeatTimeoutSecs: <int>,
    electionTimeoutMillis : <int>,
    catchUpTimeoutMillis : <int>,
    getLastErrorModes : <document>,
    getLastErrorDefaults : <document>,
    replicaSetId: <ObjectId>
  }

This documentation here

So my understanding is in a replic-set, in case a primary node goes down, one of the secondary nodes take its place, as primary; this process is called as failover. The failover mechanism also means that a majority of the nodes must be available for a primary to be elected. The available nodes decide which node becomes a primary, thru the process of election. Perhaps this is what’s going on.

In a three-member replica-set (with all members eligible to vote) the majority is 2 I believe . There must be majority voting members for an election to take place and elect a new primary. In your case, one primary and two secondary members, Is this correct?. If so, both secondary’s can become the primary in an election.

You can find more here

Out of curiosity, is there AD DC in this environment?
All these nodes that have FQDN and can the FQDN can be reachable of the network?

NOTE: I have AD DC and can ping the FQDN on each cluster node also each of my Graylog server hosts file are configured like so, /etc/hosts

192.168.1.12 GL-01.domain.com
192.168.1.13 GL-02.domain.com
192.168.1.14 GL-03.domain.com

This will insure me that each node knows about the other nodes. Not sure if that will help but it has stopped connection issues in my case.

The reason I stated this was from the following log. I don’t see the MongoDb node ncc-prod-glog3:27017, or am I not see these logs correctly?

I found this.

As for the examples and suggestion above, its not clear what is going on with this issue. What I do know is.

  1. At Thursday 11PM this happens like clockwork, kind of odd but can be a multiple reason why.
  2. From MongoDb logs looks like a network issue also , not sure if its a configuration problem, version problem or network problem.
  3. since you cannot access the Web UI ,I’m also thinking about resource configuration, not sure since no configurations were shown. Either applied to much resources or not enough, just a guess.

To find out more, probably look at the surrounding Graylog servers, from network, domain controllers and anything running during that time even server backups could have an impact on the network which could possibly impact network connection. What gets me is it only happens at a specific time, it not like random time.

EDIT: I just noticed this in the logs.

So your secondary MongoDb node steeped down as primary, Because your Master node went down then it cam back up your ncc-prod-glogdb2 was temp primary node . This really starting to look like a configuration/network issue. Something is funky going on. It may lead to your heartbeat between the two nodes.

EDIT2: Sorry I keep adding edits, I’m curious about this statement.

I’m assuming this file was edited.

/etc/sysconfig/graylog-server

In this section?

GRAYLOG_SERVER_JAVA_OPTS="-Xms3g -Xmx3g -

How about this file since it seams to be more of elasticsearch?

/etc/elasticsearch/jvm.options

In this section…

## JVM configuration

################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################

# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space

-Xms4g
-Xmx4g
-Dlog4j2.formatMsgNoLookups=true

last question, what do you see when executing this command?

root# df -h

I know you stated you add 500GB to each drive. Just curious

Hi @gsmith

Thanks for the reply

Config file - server 1

  • Elasticsearch config
cluster.name: elasticsearch
node.name: ncc-prod-glog
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 172.16.20.11
discovery.zen.ping.unicast.hosts: ["172.16.20.11", "172.16.20.12", "172.16.20.13"]
discovery.zen.minimum_master_nodes: 2
action.auto_create_index: false
  • Graylog Config
is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = (censored)
root_password_sha2 = (censored)
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://172.16.20.11:9000/api/
web_listen_uri = http://172.16.20.11:9000/
elasticsearch_hosts = http://ncc-prod-glogdb1:9200,http://ncc-prod-glogdb2:9200,http://ncc-prod-glog:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://ncc-prod-glogdb1:27017,ncc-prod-glogdb2:27017,ncc-prod-glog:27017/graylog?replicaSet=rs0
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32
  • Mongo config
systemLog:
  destination: file
  logAppend: true
  path: /var/log/mongodb/mongod.log
storage:
  dbPath: /var/lib/mongo
  journal:
    enabled: true
processManagement:
  fork: true  # fork and run in background
  pidFilePath: /var/run/mongodb/mongod.pid  # location of pidfile
  timeZoneInfo: /usr/share/zoneinfo
net:
  port: 27017
  bindIp: 0.0.0.0
replication:
  replSetName: rs0

yeah all three has node exporter - prometheus

since these servers are dedicated for graylog other than graylog,es,mongo,node exporter no other services are running.

No, rotating of index happens 2-3 days once and on checking logs rotation is not happening at that time.

yeah you are right.all three have graylog,elasticsearch,mongoDb on each.

When the issue happens server is in frozen state.Till utilization falls back no commands can be executed.After that rs.status is normal.

There is no AD DC in this environment.FQDN are configured in /etc/hosts on all 3 cluster nodes.

172.16.20.11   ncc-prod-glog
172.16.20.12   ncc-prod-glogdb1
172.16.20.13   ncc-prod-glogdb2

Suspected as clock/ntp issue but there isnt time difference to cause this. I feel more like a mongodb config issue but i’m not confident enough to reconfig mongo repl-set without jeopardizing elasticsearch data.

No this is for graylog heap space.I have increased elasticsearch heap in /etc/elasticsearch/jvm.options

-Xms4g
-Xmx4g

Output of df -h in server 1

Filesystem               Size  Used Avail Use% Mounted on
devtmpfs                  32G     0   32G   0% /dev
tmpfs                     32G     0   32G   0% /dev/shm
tmpfs                     32G  3.2G   29G  11% /run
tmpfs                     32G     0   32G   0% /sys/fs/cgroup
/dev/mapper/centos-root  657G  288G  370G  44% /
/dev/sda1               1014M  225M  790M  23% /boot
tmpfs                    6.3G     0  6.3G   0% /run/user/1003

Hello,

Ok and wow.

Does all three servers freeze up or just the master node?

Yeah, normally it because of no disk space or resources but your showing plenty of volume and resources.

Looking over my person documents with configurations I came across this .

vi /etc/security/limits.conf

All your other settings look good.
I found one difference but I’m not sure if it would cause you server to freeze up.

To be honest , not sure why you whole Graylog master node is freezing up.

Yeah, I would leave that alone, but when its running I would check out you replica-set to insure its correct. Just to confirm, using the command below is just showing that your mongo cluster replic-set status for your primary and secondary servers making sure things are correct.

rs.status()

Or perhaps

Shell# rs.isMaster() // Should Show “isMaster” =True

For metric’s I seen you have Prometheus , I should have been more clear about trending metrics. I was referring to something outside your server, but I guess if it freezes up, your not going to get anything.

By chance is there a load balancer (i.e.nginx/apache) configured on these servers?
I found this for your version, if needed
https://archivedocs.graylog.org/en/2.4/pages/configuration/multinode_setup.html#configure-multinode

EDIT: just some other notes. What you think about increasing the batch_size, and second about raising Graylog heap?

Batch config

> # Batch size for the Elasticsearch output. This is the maximum (!) number of messages the Elasticsearch output
> # module will get at once and write to Elasticsearch in a batch call. If the configured batch size has not been
> # reached within output_flush_interval seconds, everything that is available will be flushed at once. Remember
> # that every outputbuffer processor manages its own batch and performs its own batch write calls.
> # ("outputbuffer_processors" variable)
> output_batch_size = 500 --> to something like this -->1000

Graylog Heap:

GRAYLOG_SERVER_JAVA_OPTS="-Xms3g -Xmx3g

Just an idea

Hi @gsmith
Thank you for your time in reading and asking questions

All three servers freezes up and falls back normal at the same time.

vi /etc/security/limits.conf

Gone through this file all lines are commented out.

output of rs.status()

{
        "set" : "rs0",
        "date" : ISODate("2022-03-12T03:48:37.134Z"),
        "myState" : 1,
        "term" : NumberLong(20),
        "syncingTo" : "",
        "syncSourceHost" : "",
        "syncSourceId" : -1,
        "heartbeatIntervalMillis" : NumberLong(2000),
        "majorityVoteCount" : 2,
        "writeMajorityCount" : 2,
        "optimes" : {
                "lastCommittedOpTime" : {
                        "ts" : Timestamp(1647056916, 3),
                        "t" : NumberLong(20)
                },
                "lastCommittedWallTime" : ISODate("2022-03-12T03:48:36.629Z"),
                "readConcernMajorityOpTime" : {
                        "ts" : Timestamp(1647056916, 3),
                        "t" : NumberLong(20)
                },
                "readConcernMajorityWallTime" : ISODate("2022-03-12T03:48:36.629Z"),
                "appliedOpTime" : {
                        "ts" : Timestamp(1647056916, 3),
                        "t" : NumberLong(20)
                },
                "durableOpTime" : {
                        "ts" : Timestamp(1647056916, 3),
                        "t" : NumberLong(20)
                },
                "lastAppliedWallTime" : ISODate("2022-03-12T03:48:36.629Z"),
                "lastDurableWallTime" : ISODate("2022-03-12T03:48:36.629Z")
        },
        "lastStableRecoveryTimestamp" : Timestamp(1647056904, 3),
        "lastStableCheckpointTimestamp" : Timestamp(1647056904, 3),
        "electionCandidateMetrics" : {
                "lastElectionReason" : "electionTimeout",
                "lastElectionDate" : ISODate("2022-03-10T17:37:11.145Z"),
                "electionTerm" : NumberLong(20),
                "lastCommittedOpTimeAtElection" : {
                        "ts" : Timestamp(1646933519, 2),
                        "t" : NumberLong(19)
                },
                "lastSeenOpTimeAtElection" : {
                        "ts" : Timestamp(1646933793, 2),
                        "t" : NumberLong(19)
                },
                "numVotesNeeded" : 2,
                "priorityAtElection" : 1,
                "electionTimeoutMillis" : NumberLong(10000),
                "numCatchUpOps" : NumberLong(0),
                "newTermStartDate" : ISODate("2022-03-10T17:37:11.189Z"),
                "wMajorityWriteAvailabilityDate" : ISODate("2022-03-10T17:37:11.701Z")
        },
        "electionParticipantMetrics" : {
                "votedForCandidate" : true,
                "electionTerm" : NumberLong(19),
                "lastVoteDate" : ISODate("2022-03-03T17:36:32.579Z"),
                "electionCandidateMemberId" : 0,
                "voteReason" : "",
                "lastAppliedOpTimeAtElection" : {
                        "ts" : Timestamp(1646328697, 1),
                        "t" : NumberLong(18)
                },
                "maxAppliedOpTimeInSet" : {
                        "ts" : Timestamp(1646328961, 1),
                        "t" : NumberLong(18)
                },
                "priorityAtElection" : 1
        },
        "members" : [
                {
                        "_id" : 0,
                        "name" : "ncc-prod-glogdb1:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 727926,
                        "optime" : {
                                "ts" : Timestamp(1647056915, 3),
                                "t" : NumberLong(20)
                        },
                        "optimeDurable" : {
                                "ts" : Timestamp(1647056915, 3),
                                "t" : NumberLong(20)
                        },
                        "optimeDate" : ISODate("2022-03-12T03:48:35Z"),
                        "optimeDurableDate" : ISODate("2022-03-12T03:48:35Z"),
                        "lastHeartbeat" : ISODate("2022-03-12T03:48:36.009Z"),
                        "lastHeartbeatRecv" : ISODate("2022-03-12T03:48:36.732Z"),
                        "pingMs" : NumberLong(0),
                        "lastHeartbeatMessage" : "",
                        "syncingTo" : "ncc-prod-glog:27017",
                        "syncSourceHost" : "ncc-prod-glog:27017",
                        "syncSourceId" : 2,
                        "infoMessage" : "",
                        "configVersion" : 3
                },
                {
                        "_id" : 1,
                        "name" : "ncc-prod-glogdb2:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 727914,
                        "optime" : {
                                "ts" : Timestamp(1647056915, 3),
                                "t" : NumberLong(20)
                        },
                        "optimeDurable" : {
                                "ts" : Timestamp(1647056915, 3),
                                "t" : NumberLong(20)
                        },
                        "optimeDate" : ISODate("2022-03-12T03:48:35Z"),
                        "optimeDurableDate" : ISODate("2022-03-12T03:48:35Z"),
                        "lastHeartbeat" : ISODate("2022-03-12T03:48:36.009Z"),
                        "lastHeartbeatRecv" : ISODate("2022-03-12T03:48:36.490Z"),
                        "pingMs" : NumberLong(0),
                        "lastHeartbeatMessage" : "",
                        "syncingTo" : "ncc-prod-glog:27017",
                        "syncSourceHost" : "ncc-prod-glog:27017",
                        "syncSourceId" : 2,
                        "infoMessage" : "",
                        "configVersion" : 3
                },
                {
                        "_id" : 2,
                        "name" : "ncc-prod-glog:27017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 6797099,
                        "optime" : {
                                "ts" : Timestamp(1647056916, 3),
                                "t" : NumberLong(20)
                        },
                        "optimeDate" : ISODate("2022-03-12T03:48:36Z"),
                        "syncingTo" : "",
                        "syncSourceHost" : "",
                        "syncSourceId" : -1,
                        "infoMessage" : "",
                        "electionTime" : Timestamp(1646933831, 1),
                        "electionDate" : ISODate("2022-03-10T17:37:11Z"),
                        "configVersion" : 3,
                        "self" : true,
                        "lastHeartbeatMessage" : ""
                }
        ],
        "ok" : 1,
        "$clusterTime" : {
                "clusterTime" : Timestamp(1647056916, 3),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        },
        "operationTime" : Timestamp(1647056916, 3)
}

No.Planning to deploy nginx as loadbalancer but due to this issue its been held back.

Increasing the batch_size - i have go through it and try.Will let you know.
Raising Graylog heap - I dont think that does make any improvements.Our input logs are small (150-200 msgs/sec) compared to resources.Even without LB current setup can crunch through those numbers.

Also some queries,

Can i take a mongo dump, then reconfig repl-set and restore the dump? Will it work out?

EDIT:

Found this in graylog-server log
2022-03-11T00:37:12.205+07:00 WARN [NodePingThread] Did not find meta info of this node. Re-registering.

I’m not sure.
By chance are these virtual machines or physical hosts? If there VM’s have you checked the host server?

It means that most likely your system clocks are not in sync.
The graylog node is removed from mongodb because another node thinks it is
outdated.
Are the system clocks of the nodes synchronized with NTP? Check if every server has a unique http_publish_uri - means can all server talk to each other.

I found this post, it may help. In the post it states increasing stale_master_timeout

This may not be your version BUT it looks similar to your issue /w fix.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.