1. Describe your incident:
On every Thursday 11PM IST all graylog server cpu usage peaks and gradually comes to normal.This event occurs in the timeframe of 11:02 - 11:05.
- At the time Graylog GUI cant be accessed.
- Utilisation and services is not registered in grafana.
- When logged is server it is in hanged state.Any commands i.e top cant be executed to check to issue.
2. Describe your environment:
-
OS Information: Centos 7
-
Package Version:
- Graylog 2.4.3
- ES Version 5.6.0
- Mongo 4.2.12
-
Service logs, configurations, and environment variables:
-
200Messages/sec and daily logs of 15 -20 GB
-
GLog server 1 :
CPU - 16
RAM - 64 GB -
GLog server 2 :
CPU - 16
RAM - 34 GB -
GLog server 3 :
CPU - 16
RAM - 34 GB
-
3. What steps have you already taken to try and solve the problem?
Checked for diskspace.Increased drive space by 500GB on each server
Increased Elasticsearch JVM from 1G to 4G
On observing mongodb logs i can find the repel set is going down but no idea what causes it.
Logs of Graylog server 2 Mongodb:
2022-03-11T00:36:33.040+0700 I REPL [replexec-38556] Member ncc-prod-glog:27017 is now in state RS_DOWN - Request 65639798 timed out, deadline was 2022-03-11T00:32:06.26
3+0700, op was RemoteCommand 65639798 -- target:[ncc-prod-glog:27017] db:admin expDate:2022-03-11T00:32:06.263+0700 cmd:{ replSetHeartbeat: "rs0", configVersion: 3, hbv: 1, f
rom: "ncc-prod-glogdb1:27017", fromId: 0, term: 19 }
2022-03-11T00:36:33.040+0700 I REPL [replexec-38559] Member ncc-prod-glogdb2:27017 is now in state RS_DOWN - no response within election timeout period
2022-03-11T00:36:33.040+0700 I REPL [replexec-38559] can't see a majority of the set, relinquishing primary
2022-03-11T00:36:33.040+0700 I REPL [replexec-38559] Stepping down from primary in response to heartbeat
2022-03-11T00:36:33.040+0700 I REPL [RstlKillOpThread] Starting to kill user operations
2022-03-11T00:36:33.040+0700 I REPL [RstlKillOpThread] Stopped killing user operations
2022-03-11T00:36:33.040+0700 I REPL [RstlKillOpThread] State transition ops metrics: { lastStateTransition: "stepDown", userOpsKilled: 0, userOpsRunning: 0 }
2022-03-11T00:36:33.040+0700 I REPL [replexec-38559] transition to SECONDARY from PRIMARY
2022-03-11T00:36:33.047+0700 I CONNPOOL [Replication] Ending connection to host ncc-prod-glog:27017 due to bad connection status: CallbackCanceled: Callback was canceled; 0
connections to that host remain open
2022-03-11T00:36:33.047+0700 I CONNPOOL [Replication] Connecting to ncc-prod-glog:27017
2022-03-11T00:36:33.441+0700 I NETWORK [conn1949] end connection 10.100.144.46:56254 (37 connections now open)
2022-03-11T00:36:39.159+0700 I NETWORK [conn1944] end connection 10.100.144.41:33948 (36 connections now open)
2022-03-11T00:36:39.159+0700 I NETWORK [conn1938] end connection 10.100.144.41:33938 (35 connections now open)
2022-03-11T00:36:40.061+0700 I CONNPOOL [Replication] Connecting to ncc-prod-glog:27017
2022-03-11T00:36:43.736+0700 I ELECTION [replexec-38556] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majori
ty (mask 0x1)
2022-03-11T00:36:47.077+0700 I CONNPOOL [Replication] Connecting to ncc-prod-glog:27017
2022-03-11T00:36:53.040+0700 I CONNPOOL [Replication] Connecting to ncc-prod-glogdb2:27017
2022-03-11T00:36:54.093+0700 I CONNPOOL [Replication] Connecting to ncc-prod-glog:27017
2022-03-11T00:36:54.412+0700 I ELECTION [replexec-38556] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majori
ty (mask 0x1)
2022-03-11T00:36:55.540+0700 I REPL_HB [replexec-38559] Heartbeat to ncc-prod-glog:27017 failed after 2 retries, response status: NetworkInterfaceExceededTimeLimit: Couldn'
t get a connection within the time limit
2022-03-11T00:37:00.053+0700 I CONNPOOL [Replication] Connecting to ncc-prod-glogdb2:27017
2022-03-11T00:37:01.109+0700 I CONNPOOL [Replication] Connecting to ncc-prod-glog:27017
2022-03-11T00:37:05.351+0700 I ELECTION [replexec-38559] Not starting an election, since we are not electable due to: Not standing for election because I cannot see a majori
ty (mask 0x1)
2022-03-11T00:37:07.069+0700 I CONNPOOL [Replication] Connecting to ncc-prod-glogdb2:27017
2022-03-11T00:37:08.125+0700 I CONNPOOL [Replication] Connecting to ncc-prod-glog:27017
2022-03-11T00:37:10.647+0700 I NETWORK [listener] connection accepted from 10.100.144.47:43276 #1962 (36 connections now open)
2022-03-11T00:37:10.648+0700 I NETWORK [conn1962] received client metadata from 10.100.144.47:43276 conn1962: { driver: { name: "NetworkInterfaceTL", version: "4.2.12" }, o
s: { type: "Linux", name: "CentOS Linux release 7.9.2009 (Core)", architecture: "x86_64", version: "Kernel 3.10.0-1160.31.1.el7.x86_64" } }
2022-03-11T00:37:10.690+0700 I NETWORK [conn1937] end connection 10.100.144.47:21884 (35 connections now open)
2022-03-11T00:37:10.690+0700 I NETWORK [conn1955] end connection 10.100.144.47:28042 (34 connections now open)
2022-03-11T00:37:10.690+0700 I NETWORK [listener] connection accepted from 10.100.144.47:43278 #1963 (35 connections now open)
2022-03-11T00:37:10.690+0700 I NETWORK [listener] connection accepted from 10.100.144.47:43280 #1964 (36 connections now open)
2022-03-11T00:37:10.691+0700 I NETWORK [conn1963] received client metadata from 10.100.144.47:43278 conn1963: { driver: { name: "NetworkInterfaceTL", version: "4.2.12" },
4. How can the community help?
Does anyone have a suggestion on what could be the culprit?
To add to it as a step of log4j mitigation (Set the JVM option 1.3k -Dlog4j2.formatMsgNoLookups=true
) was also done on a thursday same time.Does that co-relate to this weekly issue.
Thanks in advance for your help!