My graylog cluster has three nodes, a global input (Beats port 5044), a front-end nginx for proxying to port 9000 (web ui) of graylog; and a four-layer proxy nginx for log Stream proxy to graylog’s 5044;
When the log volume is particularly large, as follows:
The journal contains 943,021,634 unprocessed messages in 2081 segments. 0 messages appended, 0 messages read in the last second.
Current lifecycle state: Override lb: dead
Message processing: Disabled
Load balancer indication: DEAD
I manually Pause message processing and Override LB status to DEAD.
What I don’t understand is why there are still a large number of logs entering this node. What are the meanings of “Pause message processing” and “Override LB status to DEAD” respectively?
My understanding is that when “Pause message processing” or “Override LB status to DEAD” is executed, the log should not enter the node?
Elasticsearch/Opensearch indexes those messages in the journal. The journal is doing what its supposed to do but if ES/OS is in read mode only or failed to index those messages the journal will fill up. If an issue has occurred it may take time for ES/OS to index thos message. Corresponding to how much processor’s you have commited. For example in the Graylog configuration file you have these…
Normally these those settings should not exceed then the amount of CPU core you have on that server. The processbuffer_processors is your heavy hitter so that should be the greater number then the rest. You may also want to check ES/OS log file for errors or warning. Perhaps cURL your ES/OS status make sure its in green state.
@gsmith Thank you very much, my node is 16 CPU, 32G memory, the following is the configuration information of my node:
- GRAYLOG_ELASTICSEARCH_DISCOVERY_ENABLED=false
- GRAYLOG_ELASTICSEARCH_REQUEST_TIMEOUT=2m
- GRAYLOG_ELASTICSEARCH_INDEX_OPTIMIZATION_JOBS=50
#- GRAYLOG_PROCESSOR_WAIT_STRATEGY=blocking
#- GRAYLOG_INPUTBUFFER_PROCESSORS=2
#- GRAYLOG_PROCESSBUFFER_PROCESSORS=16
#- GRAYLOG_OUTPUTBUFFER_PROCESSORS=12
#- GRAYLOG_RING_SIZE=65536
#- GRAYLOG_MESSAGE_JOURNAL_MAX_SIZE=5gb
#- GRAYLOG_MESSAGE_JOURNAL_MAX_AGE=12h
- GRAYLOG_HTTP_ENABLE_GZIP=true
#- GRAYLOG_ELASTICSEARCH_COMPRESSION_ENABLED=true
- GRAYLOG_ELASTICSEARCH_USE_EXPECT_CONTINUE=true
- GRAYLOG_ELASTICSEARCH_DISABLE_VERSION_CHECK=false
- GRAYLOG_ALLOW_HIGHLIGHTING=false
- GRAYLOG_ELASTICSEARCH_INDEX_OPTIMIZATION_TIMEOUT=1h
#- GRAYLOG_OUTPUT_BATCH_SIZE=4000
# if msg size 4K,so 4096 bytes * 10000 = 40.96M;every time send 40.96M data to ES
- GRAYLOG_OUTPUT_BATCH_SIZE=10000
- GRAYLOG_OUTPUT_FLUSH_INTERVAL=30
- GRAYLOG_OUTPUTBUFFER_PROCESSORS=10
- GRAYLOG_PROCESSBUFFER_PROCESSORS=16
- GRAYLOG_OUTPUTBUFFER_PROCESSOR_KEEP_ALIVE_TIME=3000
- GRAYLOG_OUTPUTBUFFER_PROCESSOR_THREADS_CORE_POOL_SIZE=2
- GRAYLOG_OUTPUTBUFFER_PROCESSOR_THREADS_MAX_POOL_SIZE=10
# if msg size 4k, 4096 * 524288 =~ 2G mem space; 4096 * 1048576 =~ 4G mem space
- GRAYLOG_RING_SIZE=524288 # 2^18=262144,2^19=524288; 2^20=1048576
- GRAYLOG_INPUTBUFFER_RING_SIZE=262144
- GRAYLOG_INPUTBUFFER_PROCESSORS=2
- GRAYLOG_INPUTBUFFER_WAIT_STRATEGY=yielding
- GRAYLOG_PROCESSOR_WAIT_STRATEGY=blocking
- GRAYLOG_OUTPUT_FAULT_COUNT_THRESHOLD=5
- GRAYLOG_OUTPUT_FAULT_PENALTY_SECONDS=15
- GRAYLOG_MESSAGE_JOURNAL_ENABLED=true
- GRAYLOG_MESSAGE_JOURNAL_MAX_AGE=12h
- GRAYLOG_MESSAGE_JOURNAL_MAX_SIZE=200gb
#- GRAYLOG_MESSAGE_JOURNAL_FLUSH_AGE=1m
# if msg size 4K,4096 * 250000 = 1G,every time write 1G data to disk
- GRAYLOG_MESSAGE_JOURNAL_FLUSH_INTERVAL=250000
#- GRAYLOG_MESSAGE_JOURNAL_SEGMENT_AGE=1h
#- GRAYLOG_MESSAGE_JOURNAL_SEGMENT_SIZE=100mb
- GRAYLOG_LB_RECOGNITION_PERIOD_SECONDS=0
- GRAYLOG_LB_THROTTLE_THRESHOLD_PERCENTAGE=90
For the setting of processbuffer_processors, outputbuffer_processors and inputbuffer_processors exceeds 16, you are very right, the setting of these parameters should not exceed the number of cpu, but processbuffer_processors set more.
What I don’t understand now is why a large number of logs enter the node after Pause message processing and Override LB status to DEAD.
Also attach my four layers of nginx configuration:
upstream graylog-input-filebeat {
#least_conn;
#server 10.65.17.206:5044 weight=3;
server 10.65.17.207:5044 weight=3;
server 10.65.17.208:5044 weight=7;
}
server {
listen 5044;
#proxy_protocol on;
proxy_timeout 3s;
#proxy_connect_timeout 2s;
proxy_pass graylog-input-filebeat;
}
If I stop the stream or stop the INPUT, then the log messages will not go to my graylog. My idea is to reduce the pressure on the current graylog by setting “Pause message processing” or “Override LB status to DEAD” to allow log messages to enter other graylog nodes.
My nodes are all under pressure right now.
fcb7e567 / 93fd50fcf2e6 In 19,046 / Out 0 msg/s.
The journal contains 120,741,327 unprocessed messages in 2140 segments. 19,125 messages appended, 0 messages read in the last second.
Current lifecycle state:Throttled
Message processing:Enabled
Load balancer indication:THROTTLED
0fa66f36 / 2b62c60bce14 In 0 / Out 0 msg/s.
The journal contains 943,127,396 unprocessed messages in 2078 segments. 0 messages appended, 0 messages read in the last second.
Current lifecycle state:Throttled
Message processing:Enabled
Load balancer indication:THROTTLED
9d883ac9 / 7ad72add9d9a In 16,380 / Out 17,267 msg/s.
The journal contains 46,879,104 unprocessed messages in 1561 segments. 14,439 messages appended, 10,499 messages read in the last second.
Current lifecycle state:Running
Message processing:Enabled
Load balancer indication:ALIVE
upstream graylog-input-filebeat {
#least_conn;
#server 10.65.17.206:5044 weight=3;
server 10.65.17.207:5044 weight=3;
server 10.65.17.208:5044 weight=7;
}
server {
listen 5044;
#proxy_protocol on;
proxy_timeout 3s;
#proxy_connect_timeout 2s;
proxy_pass graylog-input-filebeat;
}
So what is the role of “Pause message processing” and “Override LB status to DEAD”? I don’t want to operate nginx, I think it’s too troublesome and inefficient
Hi @tbag
you have quite a few messages queued on one of your hosts! “Pause message processing” means, that the processing on the Graylog-Side will stop, but not the ingestion via your inputs. This will only increase the size of your buffers.
nginx in the free version is as far as I know not capeable of reading the LB-indication by Graylog. Therefore set the LB Status will have no to little effect.
restarting ninx to reshedule all incomming inputs to other nodes will also be a gamble, they might even choose the same Graylognode again.
My solution to this is this little peace of bash-magic on the LB: ss -K dst 10.20.30.40 dport 5555
in this case 10.20.30.40 is the Graylog-node with a big queue and the dport the destination port with a high volume input on that node. This will kill all TCP connections to that machine and port. Usually sources will build up them quickly new, so hopefully another node is choosen.
Pause message processing does not mean pause collection. The input buffer and the Journal continue to fill. When you stop processing incoming messages, you relieve pressure on the processing queue or on Opensearch itself, or both.
Overriding the LB status to dead should work, but I don’t know how to make nginx listen and change based on LB status. It may not be a default setting.
Stopping the input will do what you want if the LB status does not. It will stop accepting those messages and nginx should route new messages to the other nodes.
If I stop INPUT, then my entire graylog cluster will not have log messages coming in (I only have one INPUT); this is not what I want, I just want to dynamically adjust according to a certain node.
Of course, if you create multiple INPUTs and let each graylog node correspond to one INPUT, then what you said should be possible.