Pause message processing and Override LB status to DEAD

tbag · June 30, 2023, 1:37am

My graylog cluster has three nodes, a global input (Beats port 5044), a front-end nginx for proxying to port 9000 (web ui) of graylog; and a four-layer proxy nginx for log Stream proxy to graylog’s 5044;

When the log volume is particularly large, as follows:

The journal contains 943,021,634 unprocessed messages in 2081 segments. 0 messages appended, 0 messages read in the last second.
Current lifecycle state: Override lb: dead
Message processing: Disabled
Load balancer indication: DEAD

I manually Pause message processing and Override LB status to DEAD.

What I don’t understand is why there are still a large number of logs entering this node. What are the meanings of “Pause message processing” and “Override LB status to DEAD” respectively?

My understanding is that when “Pause message processing” or “Override LB status to DEAD” is executed, the log should not enter the node?

Is my understanding correct?

gsmith · June 30, 2023, 2:04am

Hey @tbag

Elasticsearch/Opensearch indexes those messages in the journal. The journal is doing what its supposed to do but if ES/OS is in read mode only or failed to index those messages the journal will fill up. If an issue has occurred it may take time for ES/OS to index thos message. Corresponding to how much processor’s you have commited. For example in the Graylog configuration file you have these…

processbuffer_processors = 5
outputbuffer_processors = 3
inputbuffer_processors = 2

Normally these those settings should not exceed then the amount of CPU core you have on that server. The processbuffer_processors is your heavy hitter so that should be the greater number then the rest. You may also want to check ES/OS log file for errors or warning. Perhaps cURL your ES/OS status make sure its in green state.

tbag · June 30, 2023, 2:19am

@gsmith Thank you very much, my node is 16 CPU, 32G memory, the following is the configuration information of my node:

- GRAYLOG_ELASTICSEARCH_DISCOVERY_ENABLED=false
- GRAYLOG_ELASTICSEARCH_REQUEST_TIMEOUT=2m
- GRAYLOG_ELASTICSEARCH_INDEX_OPTIMIZATION_JOBS=50
  #- GRAYLOG_PROCESSOR_WAIT_STRATEGY=blocking
  #- GRAYLOG_INPUTBUFFER_PROCESSORS=2
  #- GRAYLOG_PROCESSBUFFER_PROCESSORS=16
  #- GRAYLOG_OUTPUTBUFFER_PROCESSORS=12
  #- GRAYLOG_RING_SIZE=65536
  #- GRAYLOG_MESSAGE_JOURNAL_MAX_SIZE=5gb
  #- GRAYLOG_MESSAGE_JOURNAL_MAX_AGE=12h
- GRAYLOG_HTTP_ENABLE_GZIP=true
  #- GRAYLOG_ELASTICSEARCH_COMPRESSION_ENABLED=true
- GRAYLOG_ELASTICSEARCH_USE_EXPECT_CONTINUE=true
- GRAYLOG_ELASTICSEARCH_DISABLE_VERSION_CHECK=false
- GRAYLOG_ALLOW_HIGHLIGHTING=false
- GRAYLOG_ELASTICSEARCH_INDEX_OPTIMIZATION_TIMEOUT=1h
  #- GRAYLOG_OUTPUT_BATCH_SIZE=4000
  # if msg size 4K，so 4096 bytes * 10000 = 40.96M；every time send 40.96M data to ES 
- GRAYLOG_OUTPUT_BATCH_SIZE=10000
- GRAYLOG_OUTPUT_FLUSH_INTERVAL=30
- GRAYLOG_OUTPUTBUFFER_PROCESSORS=10
- GRAYLOG_PROCESSBUFFER_PROCESSORS=16
- GRAYLOG_OUTPUTBUFFER_PROCESSOR_KEEP_ALIVE_TIME=3000
- GRAYLOG_OUTPUTBUFFER_PROCESSOR_THREADS_CORE_POOL_SIZE=2
- GRAYLOG_OUTPUTBUFFER_PROCESSOR_THREADS_MAX_POOL_SIZE=10
  # if msg size 4k, 4096 * 524288 =~ 2G mem space; 4096 * 1048576 =~ 4G mem space
- GRAYLOG_RING_SIZE=524288 # 2^18=262144,2^19=524288; 2^20=1048576
- GRAYLOG_INPUTBUFFER_RING_SIZE=262144
- GRAYLOG_INPUTBUFFER_PROCESSORS=2
- GRAYLOG_INPUTBUFFER_WAIT_STRATEGY=yielding  
- GRAYLOG_PROCESSOR_WAIT_STRATEGY=blocking
- GRAYLOG_OUTPUT_FAULT_COUNT_THRESHOLD=5
- GRAYLOG_OUTPUT_FAULT_PENALTY_SECONDS=15
- GRAYLOG_MESSAGE_JOURNAL_ENABLED=true
- GRAYLOG_MESSAGE_JOURNAL_MAX_AGE=12h  
- GRAYLOG_MESSAGE_JOURNAL_MAX_SIZE=200gb
  #- GRAYLOG_MESSAGE_JOURNAL_FLUSH_AGE=1m
  # if msg size 4K，4096 * 250000 = 1G，every time write 1G data to disk
- GRAYLOG_MESSAGE_JOURNAL_FLUSH_INTERVAL=250000
  #- GRAYLOG_MESSAGE_JOURNAL_SEGMENT_AGE=1h
  #- GRAYLOG_MESSAGE_JOURNAL_SEGMENT_SIZE=100mb
- GRAYLOG_LB_RECOGNITION_PERIOD_SECONDS=0
- GRAYLOG_LB_THROTTLE_THRESHOLD_PERCENTAGE=90

For the setting of processbuffer_processors, outputbuffer_processors and inputbuffer_processors exceeds 16, you are very right, the setting of these parameters should not exceed the number of cpu, but processbuffer_processors set more.

What I don’t understand now is why a large number of logs enter the node after Pause message processing and Override LB status to DEAD.

Also attach my four layers of nginx configuration:

upstream graylog-input-filebeat {
    #least_conn;
    #server 10.65.17.206:5044 weight=3;
    server 10.65.17.207:5044 weight=3;
    server 10.65.17.208:5044 weight=7;
}

server {
    listen  5044;
    #proxy_protocol on;
    proxy_timeout 3s;
    #proxy_connect_timeout 2s;
    proxy_pass graylog-input-filebeat;
}

gsmith · June 30, 2023, 2:36am

Hey

Did you pause from the “Stream” or from “System/Node”?

As for messages coming in not sure. It should have stopped it.

EDIT: I forgot to tell ya, You can stop the INPUT if need be and see if that helps.

tbag · June 30, 2023, 2:54am

Yes, I did it from system/nodes.

If I stop the stream or stop the INPUT, then the log messages will not go to my graylog. My idea is to reduce the pressure on the current graylog by setting “Pause message processing” or “Override LB status to DEAD” to allow log messages to enter other graylog nodes.

My nodes are all under pressure right now.

 fcb7e567 / 93fd50fcf2e6 In 19,046 / Out 0 msg/s.
The journal contains 120,741,327 unprocessed messages in 2140 segments. 19,125 messages appended, 0 messages read in the last second.
Current lifecycle state:Throttled
Message processing:Enabled
Load balancer indication:THROTTLED

 0fa66f36 / 2b62c60bce14 In 0 / Out 0 msg/s.
The journal contains 943,127,396 unprocessed messages in 2078 segments. 0 messages appended, 0 messages read in the last second.
Current lifecycle state:Throttled
Message processing:Enabled
Load balancer indication:THROTTLED

 9d883ac9 / 7ad72add9d9a In 16,380 / Out 17,267 msg/s.
The journal contains 46,879,104 unprocessed messages in 1561 segments. 14,439 messages appended, 10,499 messages read in the last second.
Current lifecycle state:Running
Message processing:Enabled
Load balancer indication:ALIVE

gsmith · June 30, 2023, 3:04am

Yeah that would work, perhaps remove the Graylog inquestion from nginx and just stream it to your second node?

tbag · June 30, 2023, 3:14am

Yes, annotate a graylog node in nginx upstream：

upstream graylog-input-filebeat {
    #least_conn;
    #server 10.65.17.206:5044 weight=3;
    server 10.65.17.207:5044 weight=3;
    server 10.65.17.208:5044 weight=7;
}

server {
    listen  5044;
    #proxy_protocol on;
    proxy_timeout 3s;
    #proxy_connect_timeout 2s;
    proxy_pass graylog-input-filebeat;
}

So what is the role of “Pause message processing” and “Override LB status to DEAD”? I don’t want to operate nginx, I think it’s too troublesome and inefficient

gsmith · June 30, 2023, 3:19am

Not sure, probably Admin would be my guess.

Agree. To be honest this is a first I heard someone wanted to pause Graylog node like this so not 100% sure.

tbag · June 30, 2023, 3:25am

@gsmith Thanks for your answer, I really can’t think of a better way

gsmith · June 30, 2023, 3:26am

Same here, sorry I cant be more help @tbag
Give it a day see if someone else comes around.

ihe · July 3, 2023, 1:38pm

Hi @tbag
you have quite a few messages queued on one of your hosts! “Pause message processing” means, that the processing on the Graylog-Side will stop, but not the ingestion via your inputs. This will only increase the size of your buffers.
nginx in the free version is as far as I know not capeable of reading the LB-indication by Graylog. Therefore set the LB Status will have no to little effect.
restarting ninx to reshedule all incomming inputs to other nodes will also be a gamble, they might even choose the same Graylognode again.

My solution to this is this little peace of bash-magic on the LB:
ss -K dst 10.20.30.40 dport 5555
in this case 10.20.30.40 is the Graylog-node with a big queue and the dport the destination port with a high volume input on that node. This will kill all TCP connections to that machine and port. Usually sources will build up them quickly new, so hopefully another node is choosen.

tbag · July 7, 2023, 4:55am

@ihe thanks for you answer

joe.gross · July 7, 2023, 7:17pm

Pause message processing does not mean pause collection. The input buffer and the Journal continue to fill. When you stop processing incoming messages, you relieve pressure on the processing queue or on Opensearch itself, or both.

Overriding the LB status to dead should work, but I don’t know how to make nginx listen and change based on LB status. It may not be a default setting.

Stopping the input will do what you want if the LB status does not. It will stop accepting those messages and nginx should route new messages to the other nodes.

gsmith · July 7, 2023, 9:03pm

Hey

That was what I was thinking, if you stop the INPUT no logs should come in.

tbag · July 8, 2023, 1:35am

If I stop INPUT, then my entire graylog cluster will not have log messages coming in (I only have one INPUT); this is not what I want, I just want to dynamically adjust according to a certain node.

Of course, if you create multiple INPUTs and let each graylog node correspond to one INPUT, then what you said should be possible.

joe.gross · July 10, 2023, 1:45pm

@tbag,

You are correct. If it’s a global input, you can’t stop just one node.

Your best bet is to look into how to make nginx honor LB status messages. I would also try to track down what caused the backup in the first place.

system · July 24, 2023, 1:46pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Certain Mesages seem to stop Graylognodes Graylog Central (peer support)	17	1303	March 6, 2019
Uneven distribution of unprocessed messages Graylog Central (peer support) pipeline-rules , dump-messagespl	13	2076	August 27, 2021
Graylog Cluster, Buffer process 100% stop process messages Graylog Central (peer support)	22	17071	November 28, 2018
Graylog stops processing/writing messages Graylog Central (peer support)	12	9576	November 23, 2017
How to debug missing messages Graylog Central (peer support)	15	3574	February 4, 2022

Pause message processing and Override LB status to DEAD

Related topics