My graylog often does not consume data after a period of time

magicgaro · September 1, 2023, 1:28am

1. Describe your incident:
My graylog often doesn’t consume data after some time. It was running normally for the past year, but this problem suddenly occurred recently. After restarting gralog-server, it will return to normal, but after a while, the problem of not consuming data will appear again, maybe a few hours or a day.

2. Describe your environment:

OS Information:CentOS Linux release 7.4.1708
Package Version:Graylog 5.1.4
Service logs, configurations, and environment variables:
2023-09-01T06:52:48.935+08:00 INFO [CreateNewSingleIndexRangeJob] Created ranges for index g4-service_14267.
2023-09-01T06:52:48.941+08:00 INFO [SystemJobManager] SystemJob <0882e152-4851-11ee-91b8-78ac443b71f8> [org.graylog2.indexer.indices.jobs.SetIndexReadOnlyAndCalculateRangeJob] finished in 2198ms.
2023-09-01T07:02:25.805+08:00 INFO [SystemJobManager] SystemJob <1ad3c0e0-4851-11ee-91b8-78ac443b71f8> [org.graylog2.indexer.indices.jobs.OptimizeIndexJob] finished in 578334ms.
2023-09-01T07:31:10.442+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-1, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:10.444+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-0, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:10.578+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-2, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:10.718+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60853dadb3dda129103843e9-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:10.729+08:00 INFO [ConsumerCoordinator] [Consumer clientId=gl2-5312e0e0-60853dadb3dda129103843e9-0, groupId=hkgraylog] Revoke previously assigned partitions g3-app-log-2
2023-09-01T07:31:10.730+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60853dadb3dda129103843e9-0, groupId=hkgraylog] (Re-)joining group
2023-09-01T07:31:10.740+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60891b6db3dda129103c7902-1, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:10.741+08:00 INFO [ConsumerCoordinator] [Consumer clientId=gl2-5312e0e0-60891b6db3dda129103c7902-1, groupId=hkgraylog] Revoke previously assigned partitions
2023-09-01T07:31:10.741+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60891b6db3dda129103c7902-1, groupId=hkgraylog] (Re-)joining group
2023-09-01T07:31:11.142+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-6087ae5fb3dda129103aec0c-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:11.143+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60891b6db3dda129103c7902-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:11.167+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60877967b3dda129103ab20c-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:11.189+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-6089084db3dda129103c6404-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:11.469+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-1, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:11.524+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-1, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:11.536+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-2, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:11.538+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-0, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:11.569+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-2, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:11.579+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-0, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:13.469+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-1, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:13.475+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-0, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:13.587+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-2, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:14.145+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-6087ae5fb3dda129103aec0c-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:14.145+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60891b6db3dda129103c7902-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:14.170+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60877967b3dda129103ab20c-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:14.192+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-6089084db3dda129103c6404-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:14.471+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-1, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:14.527+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-1, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:14.539+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-2, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:14.540+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-0, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:14.571+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-2, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:14.584+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-0, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:16.474+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-1, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:16.479+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-0, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:16.592+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-2, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:17.147+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60891b6db3dda129103c7902-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:17.147+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-6087ae5fb3dda129103aec0c-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:17.172+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60877967b3dda129103ab20c-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:17.194+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-6089084db3dda129103c6404-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:17.473+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-1, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
2023-09-01T07:31:17.532+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-1, groupId=g4app] Attempt to heartbeat failed since group is rebalancing

3. What steps have you already taken to try and solve the problem?
1.upgrade graylog-server version to 5.1.4.
2.upgrade kafka version.

4. How can the community help?

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

magicgaro · September 1, 2023, 1:30am

The graylog log does not have any error logs. It’s just that a large number of INFO logs are refreshed at the time of the problem: ttempt to heartbeat failed since group is rebalancing. Please help me locate the problem.

magicgaro · September 1, 2023, 1:33am

There is also this kind of log: due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

gsmith · September 2, 2023, 1:17am

Hey @magicgaro

From what I understand is this maybe related to kafka.
If you get a heartbeat failure because the group is rebalancing, it indicates that your instance took too long to send the next heartbeat and was considered dead and thus a rebalance got triggered. I would not only look at Graylog but any other logs that pertain to your stack. Ensure all your services are running correctly with out issues.

magicgaro · September 3, 2023, 1:42am

Thanks for your answer, I will try it on kafka.

magicgaro · September 5, 2023, 2:21am

I upgraded the version of kafka and its memory configuration but it still has no effect, and graylog will not process the data for drawing the floor plan when there is a problem with graylog.

It’s like it’s stuck.

magicgaro · September 5, 2023, 2:22am

Can anyone help me. It was so torturous.

gsmith · September 6, 2023, 2:21am

Hey @magicgaro

Looks like you Processor buffer is maxed out.

Graylog Default config should like something like this…

processbuffer_processors = 5
outputbuffer_processors = 3
inputbuffer_processors = 2

The buffer are full becauseit might be that ES/OS/kafka can’t absorb as many messages as graylog tries to send.

Your processbuffer_processors is the heavy hitter it would be the highest thread count. The rule of thumb would be is that add all three of those up and it should be equal or less than the number of cores you have on that server. The example above you should have at least 10 cores. FYI , each one makes a thread count…

Next,
Check your resources ( i.e. CPU,Memory)
Check your status Kafka service.
Check log files of Kafka.
Post anything relevant from the log files , if you do, remove personal information before posting.
Showing your configuration file’s would be apperciated.

magicgaro · September 6, 2023, 2:28am

thanks，this my config:
processbuffer_processors = 20
outputbuffer_processors = 40
inputbuffer_processors = 2

There is no abnormality in the kafka log, and the ES status is also green when graylog has problems. However, ES has changed the state to yellow in the two hours when graylog has problems.

gsmith · September 6, 2023, 4:03am

Hey @magicgaro

That seam like alot of processbuffers, is this a cluster environment you have? you maybe over committing.
Does your Graylog server have 62 CPU cores?
Just an FYI as I stated above processbuffer_processors is you heavy hitter.

EDIT: Just to give you an idea I can push 51 million logs a day with 12 CPU’s.

By chance do you have Extractors and Piplines configured?

You could check process-buffer

magicgaro · September 6, 2023, 5:59am

my graylog server has 48 cores with 1 node and i have 5 servers. My daily log volume in es is about 1 billion +. The configuration has not been changed, and this problem has occurred frequently since August 20.

Today I tried adding 3 graylog servers. Now there are 8 graylog servers and 5 es servers in total. See if this problem can be solved.

gsmith · September 6, 2023, 9:11pm

hey @magicgaro

Was there any changes to this cluster prior to this Issue?

Just a suggestion if I had this issue, I would reduce that amount of…

outputbuffer_processors = 40 --> outputbuffer_processors = 30

Then add…

processbuffer_processors = 20 --> processbuffer_processors = 30

Having 8 Graylog servers may not help proccess messages/logs. I would look at Resources and Elasticsearch.

Sometimes this could be from a bad Extractor or pipeline, I’m not sure with this issue because theres not much data to go off of.

magicgaro · September 7, 2023, 2:08am

Thanks for the help, I didn’t make any changes to it at the time of the issue. After adding 3 servers, the cluster is currently running stably. If the issue again I will try to modify these parameters.

magicgaro · September 12, 2023, 1:53am

It’s still having issues and I don’t know what to do anymore.

gsmith · September 12, 2023, 2:16am

Hey @magicgaro

EDIT:
I misread your post.
looking at you Graylog settings as i stated above does not look right.

The processbuffer_processors need to have a higher count then the outputbuffer_processors .
What I would suggest is…

processbuffer_processors = 40
outputbuffer_processors = 20
inputbuffer_processors = 2

Adding Graylog server not going to help much for proccessing Logs this is done through Opensearch/elasticsearch .

In laymans terms.

Graylog directs and controls.
MongoDb has metadata
Opensearch processes and stores data

You need to check Opensearch/elasticsearch status and ensure they are not in read-mode.

curl http://localhost:9200/ _cluster/health
curl http://localhost:9200/ _cluster/settings

It might take a while to index messages from your journal, I had to wait a few hours to index 2 million messages. This all depends on your resources.

What are the resources of you Opensearch/kafka cluster?

EDIT:2 Judging on how many messages you are recieving, I would have added more Opensearch nodes to that cluster. The front end seams good but you are have a issue processing.

magicgaro · September 12, 2023, 11:53am

Yes, I confirmed that my es cluster is normal and the status is green. I just restart my graylog-server and it goes back to normal. The strange thing is that I have two sets of graylog servers. The traffic of the other set is greater than the problematic one, but the other set is normal.

magicgaro · September 13, 2023, 2:01am

You see, this is a problem. As time goes by, all nodes will stop consuming. It’s not that all nodes don’t work at once.

gsmith · September 13, 2023, 3:11am

Hey @magicgaro

Can you show your configuration/s on how this cluster is set up?

I assume you followed these instructions prior to setting up this cluster here

Are you using a load balancer in front?, if so can you show how that is setup?

Your using a lot of heap and again, I assume you see this here

I’m also keeping in mind what you stated…

Does each node have Graylog/MongoDb & Opensearch/kafka ? or did you separate Opensearch/kafka from Graylog/mongoDb? Im just checking. Im trying to figure out this setup to be more helpful.

Next, I see your using IP address not a FQDN so what DNS server are you using? The reason I ask this is because if you just using 8.8.8.8 then the hosts files on each node should be configure, just an idea. I realize this just started happening as you stated also stated nothing has change, but it had to for this to happen. Maybe the setting were not configured corretly and just took time to create an issue like this.

magicgaro · September 13, 2023, 5:16am

Graylog-server and opensearch are together, and kafka and MongoDB are separate. The server uses a private IP in IDC. The DNS pointer is also a self-built DNS, and the upward recursion is 8.8.8.8

gsmith · September 13, 2023, 9:20pm

I think your settup might be fighting over resources, not 100% sure.

Normally when you see this

I can say its either resources and/or configurations. Something trigger this issue but i gave all my info above to try to resolve it ,maybe some else here has a better idea.

Topic		Replies	Views
Graylog stops processing logs at the same time every day Graylog Central (peer support)	5	1253	March 15, 2019
GL Data nodes sometimes automatically restart Graylog Central (peer support)	4	702	November 27, 2018
Graylog host logs and monitoring Graylog Central (peer support) architecture	5	342	November 27, 2023
Log system Graylog Graylog Central (peer support)	8	427	July 26, 2019
Notifications error Graylog Central (peer support) alert	1	62	July 8, 2024

My graylog often does not consume data after a period of time

Related topics