My graylog often does not consume data after a period of time

1. Describe your incident:
My graylog often doesn’t consume data after some time. It was running normally for the past year, but this problem suddenly occurred recently. After restarting gralog-server, it will return to normal, but after a while, the problem of not consuming data will appear again, maybe a few hours or a day.

2. Describe your environment:

  • OS Information:CentOS Linux release 7.4.1708

  • Package Version:Graylog 5.1.4

  • Service logs, configurations, and environment variables:
    2023-09-01T06:52:48.935+08:00 INFO [CreateNewSingleIndexRangeJob] Created ranges for index g4-service_14267.
    2023-09-01T06:52:48.941+08:00 INFO [SystemJobManager] SystemJob <0882e152-4851-11ee-91b8-78ac443b71f8> [org.graylog2.indexer.indices.jobs.SetIndexReadOnlyAndCalculateRangeJob] finished in 2198ms.
    2023-09-01T07:02:25.805+08:00 INFO [SystemJobManager] SystemJob <1ad3c0e0-4851-11ee-91b8-78ac443b71f8> [org.graylog2.indexer.indices.jobs.OptimizeIndexJob] finished in 578334ms.
    2023-09-01T07:31:10.442+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-1, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:10.444+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-0, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:10.578+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-2, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:10.718+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60853dadb3dda129103843e9-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:10.729+08:00 INFO [ConsumerCoordinator] [Consumer clientId=gl2-5312e0e0-60853dadb3dda129103843e9-0, groupId=hkgraylog] Revoke previously assigned partitions g3-app-log-2
    2023-09-01T07:31:10.730+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60853dadb3dda129103843e9-0, groupId=hkgraylog] (Re-)joining group
    2023-09-01T07:31:10.740+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60891b6db3dda129103c7902-1, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:10.741+08:00 INFO [ConsumerCoordinator] [Consumer clientId=gl2-5312e0e0-60891b6db3dda129103c7902-1, groupId=hkgraylog] Revoke previously assigned partitions
    2023-09-01T07:31:10.741+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60891b6db3dda129103c7902-1, groupId=hkgraylog] (Re-)joining group
    2023-09-01T07:31:11.142+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-6087ae5fb3dda129103aec0c-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:11.143+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60891b6db3dda129103c7902-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:11.167+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60877967b3dda129103ab20c-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:11.189+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-6089084db3dda129103c6404-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:11.469+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-1, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:11.524+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-1, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:11.536+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-2, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:11.538+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-0, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:11.569+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-2, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:11.579+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-0, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:13.469+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-1, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:13.475+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-0, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:13.587+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-2, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:14.145+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-6087ae5fb3dda129103aec0c-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:14.145+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60891b6db3dda129103c7902-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:14.170+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60877967b3dda129103ab20c-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:14.192+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-6089084db3dda129103c6404-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:14.471+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-1, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:14.527+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-1, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:14.539+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-2, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:14.540+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-0, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:14.571+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-2, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:14.584+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-0, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:16.474+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-1, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:16.479+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-0, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:16.592+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-611b252a807b003fc22702e6-2, groupId=graylog2] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:17.147+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60891b6db3dda129103c7902-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:17.147+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-6087ae5fb3dda129103aec0c-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:17.172+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-60877967b3dda129103ab20c-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:17.194+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-6089084db3dda129103c6404-0, groupId=hkgraylog] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:17.473+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b9459d79e0b75aa0c4865-1, groupId=g4app] Attempt to heartbeat failed since group is rebalancing
    2023-09-01T07:31:17.532+08:00 INFO [AbstractCoordinator] [Consumer clientId=gl2-5312e0e0-610b94e5d79e0b75aa0c492e-1, groupId=g4app] Attempt to heartbeat failed since group is rebalancing

3. What steps have you already taken to try and solve the problem?
1.upgrade graylog-server version to 5.1.4.
2.upgrade kafka version.

4. How can the community help?

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

The graylog log does not have any error logs. It’s just that a large number of INFO logs are refreshed at the time of the problem: ttempt to heartbeat failed since group is rebalancing. Please help me locate the problem.

There is also this kind of log: due to consumer poll timeout has expired. This means the time between subsequent calls to poll() was longer than the configured max.poll.interval.ms, which typically implies that the poll loop is spending too much time processing messages. You can address this either by increasing max.poll.interval.ms or by reducing the maximum size of batches returned in poll() with max.poll.records.

Hey @magicgaro

From what I understand is this maybe related to kafka.
If you get a heartbeat failure because the group is rebalancing, it indicates that your instance took too long to send the next heartbeat and was considered dead and thus a rebalance got triggered. I would not only look at Graylog but any other logs that pertain to your stack. Ensure all your services are running correctly with out issues.

Thanks for your answer, I will try it on kafka.

I upgraded the version of kafka and its memory configuration but it still has no effect, and graylog will not process the data for drawing the floor plan when there is a problem with graylog.


It’s like it’s stuck.

Can anyone help me. It was so torturous.

Hey @magicgaro

Looks like you Processor buffer is maxed out.

Graylog Default config should like something like this…

processbuffer_processors = 5
outputbuffer_processors = 3
inputbuffer_processors = 2

The buffer are full becauseit might be that ES/OS/kafka can’t absorb as many messages as graylog tries to send.

Your processbuffer_processors is the heavy hitter it would be the highest thread count. The rule of thumb would be is that add all three of those up and it should be equal or less than the number of cores you have on that server. The example above you should have at least 10 cores. FYI , each one makes a thread count…

Next,
Check your resources ( i.e. CPU,Memory)
Check your status Kafka service.
Check log files of Kafka.
Post anything relevant from the log files , if you do, remove personal information before posting.
Showing your configuration file’s would be apperciated.

thanks,this my config:
processbuffer_processors = 20
outputbuffer_processors = 40
inputbuffer_processors = 2

There is no abnormality in the kafka log, and the ES status is also green when graylog has problems. However, ES has changed the state to yellow in the two hours when graylog has problems.

Hey @magicgaro

That seam like alot of processbuffers, is this a cluster environment you have? you maybe over committing.
Does your Graylog server have 62 CPU cores?
Just an FYI as I stated above processbuffer_processors is you heavy hitter.

EDIT: Just to give you an idea I can push 51 million logs a day with 12 CPU’s.

By chance do you have Extractors and Piplines configured?

You could check process-buffer

my graylog server has 48 cores with 1 node and i have 5 servers. My daily log volume in es is about 1 billion +. The configuration has not been changed, and this problem has occurred frequently since August 20.


Today I tried adding 3 graylog servers. Now there are 8 graylog servers and 5 es servers in total. See if this problem can be solved.

hey @magicgaro

Was there any changes to this cluster prior to this Issue?

Just a suggestion if I had this issue, I would reduce that amount of…

outputbuffer_processors = 40 --> outputbuffer_processors = 30

Then add…

processbuffer_processors = 20 --> processbuffer_processors = 30

Having 8 Graylog servers may not help proccess messages/logs. I would look at Resources and Elasticsearch.

Sometimes this could be from a bad Extractor or pipeline, I’m not sure with this issue because theres not much data to go off of.

Thanks for the help, I didn’t make any changes to it at the time of the issue. After adding 3 servers, the cluster is currently running stably. If the issue again I will try to modify these parameters.

1 Like

It’s still having issues and I don’t know what to do anymore.

Hey @magicgaro

EDIT:
I misread your post.
looking at you Graylog settings as i stated above does not look right.

The processbuffer_processors need to have a higher count then the outputbuffer_processors .
What I would suggest is…

processbuffer_processors = 40
outputbuffer_processors = 20
inputbuffer_processors = 2

Adding Graylog server not going to help much for proccessing Logs this is done through Opensearch/elasticsearch .

In laymans terms.

  • Graylog directs and controls.
  • MongoDb has metadata
  • Opensearch processes and stores data

You need to check Opensearch/elasticsearch status and ensure they are not in read-mode.

curl http://localhost:9200/ _cluster/health
curl http://localhost:9200/ _cluster/settings

It might take a while to index messages from your journal, I had to wait a few hours to index 2 million messages. This all depends on your resources.

What are the resources of you Opensearch/kafka cluster?

EDIT:2 Judging on how many messages you are recieving, I would have added more Opensearch nodes to that cluster. The front end seams good but you are have a issue processing.

Yes, I confirmed that my es cluster is normal and the status is green. I just restart my graylog-server and it goes back to normal. The strange thing is that I have two sets of graylog servers. The traffic of the other set is greater than the problematic one, but the other set is normal.


You see, this is a problem. As time goes by, all nodes will stop consuming. It’s not that all nodes don’t work at once.

Hey @magicgaro

Can you show your configuration/s on how this cluster is set up?

I assume you followed these instructions prior to setting up this cluster here

Are you using a load balancer in front?, if so can you show how that is setup?

Your using a lot of heap and again, I assume you see this here

I’m also keeping in mind what you stated…

Does each node have Graylog/MongoDb & Opensearch/kafka ? or did you separate Opensearch/kafka from Graylog/mongoDb? Im just checking. Im trying to figure out this setup to be more helpful.

Next, I see your using IP address not a FQDN so what DNS server are you using? The reason I ask this is because if you just using 8.8.8.8 then the hosts files on each node should be configure, just an idea. I realize this just started happening as you stated also stated nothing has change, but it had to for this to happen. Maybe the setting were not configured corretly and just took time to create an issue like this.

Graylog-server and opensearch are together, and kafka and MongoDB are separate. The server uses a private IP in IDC. The DNS pointer is also a self-built DNS, and the upward recursion is 8.8.8.8

I think your settup might be fighting over resources, not 100% sure.

Normally when you see this

I can say its either resources and/or configurations. Something trigger this issue but i gave all my info above to try to resolve it ,maybe some else here has a better idea.