Uneven distribution of unprocessed messages

I’ve already read several topics related to this and most cases the the proposed solution was to use a load balancer to send traffic to all Graylog nodes.
But our case is a bit different and I can’t see how can we improve that.
We have lots of nodes send their logs to a logstash server as a proxy, then that proxy send the logs to 3 x Kafka nodes which reside on our Graylogs.
Then we have an Global input that read logs from the Kafka cluster.
so everything seem to be distributed and still we have this situation and I can find a solution to that.

Something to add:
I’ve already checked the state of our Elasticsarch and it is green and non of the nodes even have high load.
The error logs don’t show anything special either.

Hello && Welcome

I need to ask a few questions to help with your issue.

  1. What version of Graylog, Elasticsearch are you using?
  2. How much resources do you have for each one of these Graylog servers?
  3. Is it possible to see you Graylog configuration file? I’m look to see how you configured your buffers.
  4. Do you see anything in your log (elasticsearch, graylog, mongodb) files on each one of these servers?
  5. What does each one of those nodes Buffers (Input , Process, and Output) look like?

Check Status on Shards.

curl -XGET http://localhost:9200/_cat/shards

To be honest, when I see the journal getting that full it normally indicates something wrong with processing those message, which points to Elasticsearch.Either you need to adjust your configuration or add resources. I wont know till you show more information.

Hello and Thanks for your response.
My main question is not why we have unprocessed messages, but why they are only on one node.
However, here is more information regarding your questions:

1 - Graylog is the latest version 4.1.2, and Elasticsearch is 6.8
2 - 3x Graylog nodes each 4 CPU, 8GB RAM, 3 x ES nodes , each 8 CPU + 64GB RAM
3 - Here is my Graylog config files

bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
elasticsearch_cluster_name = graylog
elasticsearch_discovery_zen_ping_multicast_enabled = false
elasticsearch_hosts = http://xx.xx.xx.xx:9200,http://xx.xx.xx.xx:9200,http://xx.xx.xx.xx:9200
elasticsearch_index_prefix = graylog
elasticsearch_node_name = <Node_name>
elasticsearch_replicas = 1
elasticsearch_shards = 5
http_bind_address = 0.0.0.0:9000
http_enable_cors = true
http_enable_tls = true
http_external_uri = https://<LB_FQDN>:9000/
http_publish_uri = https://<Node_FQDN>:9000
http_tls_cert_file = /etc/pki/tls/certs/<cert_file_name>.crt
http_tls_key_file = /etc/pki/tls/private/<key_file_name<.pem
is_master = true
message_journal_dir = /var/lib/graylog-server/journal
mongodb_uri = mongodb://xx.xx.xx.xx:27017,xx.xx.xx.xx:27017,xx.xx.xx.xx:27017/graylog?replicaSet=graylog-staging
output_batch_size = 4000
output_flush_interval = 1
outputbuffer_processors = 5
plugin_dir = /usr/share/graylog-server/plugin
processbuffer_processors = 10
ring_size = 65536
timeout = 60s

4 - There were some indexing errors in Graylog and ES logs, that we suspect might have caused the slow indexing and we are trying to fix them by adjusting our pipeline rules, however the question still remains that why all the unprocessed messages get queued in one node and the load doesn’t get distributed

5 - During the time I took the screenshot when we had more than 2 millions unprocessed messages on one of the node, I could see that only “processing” buffer on that node was 100% utilized. Input and output buffer utilization was 10-20% tops, and the other two nodes utilization was same 10-20%

Nothing shady by checking the shards either.

During that time, the node with 2 millions of unprocessed messages had 99% CPU utilization, while the other two had ~10%
Since I’ve described the way that we set our inputs (Kafka cluster), I’m wondering how can we help to distribute the load.

Thanks again

Hello,

Sorry for the delay I went to a wedding.

As for your

If there is an influx of messages and it starts to overwhelm Graylog the journal will fill up quick and it does take some time to recover. Just a side note 12 GB took 1.5 hours to clear.

Elasticsearch should be able to distribute these logs.
Best practices is your buffers should match the number of CPUs in your GL configurations.
The example below is my lab Graylog 4.0 server. This server is ingesting 30GB a day and has 10 GB memory, 12 CPUs.

output_fault_penalty_seconds = 30
processbuffer_processors = 7
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_size = 12gb

So, you have 8 CPUs for each but your ES but configuration it shows

outputbuffer_processors = 5
processbuffer_processors = 10

That would be 15 CPU’s and I don’t know what you have configured for an inputbuffer. What this means is graylog is creating more threads than you have CPU cores.
I’m not 100% sure, but this might one of your issues.

Also I would check to make sure each Graylog configuration file is matching.
The one node with the journal filling up make sure there isnt anything in the way like a firewall.

Unfortunately, I had this problem once and corrected it by adjusting my configuration to match my CPU’s and adding more resources. I was able to do this quickly because all my servers are virtual machines. Then restart graylog services.

I did notice there isn’t an inputbuffer_processors configuration shown.

I would make sure that the processbuffer, outputbuffer, and inputbuffer are matching the amount CPU cores. Judging from the amount of message your trying to ingest. You may need to increase the amount of CPU’s on each elasticsearch node and then adjust your configuration file.

I believe this is a Elasitcsearch problem and the configurations/resources are creating this issue.
By chance are all of your shards evenly distirbuted?

Example:

I did find some old post on the same issue you have. They might not be be exactly as what you have but the issues seem similar.

Maybe these might help give you some ideas on how resolve your issue.

And here is something on Elasticsearch.

1 Like

Do you have throttling enabled for that Input? I have no experience with it, as we deal with inputs that don’t support it, but I checked the source code[1]. When enabled the input should stop reading on that node as soon as the journal hits more than 100k unprocessed messages.

Besides that, depending on what processing you do, you might possibly have some very inefficent edge cases that get the node stuck.

Edit: [1] Why so complicated, its also in the docs: Sending in log data — Graylog 4.1.0 documentation

Thanks for the response @gsmith and sorry for delay
The logic behind Graylog is still a bit like magic to me.
During our incident when 2 Milions messages were in the queue, the CPU load was almost 100% on that specific Graylog node, however the ES nodes didn’t have more than 10% load.
So I doubt that changing "processbuffer"and “outputbuffer” really change that since according to your explanation they are related to ES.
The Graylog configs are identical on all nodes expect the URL and Node fields, We use Puppet to maintain the configs and it is easy to keep them identical.
My suspicious and understanding is that since it is the Graylog node that has the highest CPU load, It is the Pipeline rules that cause this load that is being considered as processing and fill up the processing buffer.
We are looking at our indexing errors (since there are a lot of them), It seems that optimising them is helping with processing speed.
But the unanswered question still remains that why one node get 20,000. messages at once, while the other two are idle and each has 500 messages.

Yes we had throttling which actually made the situation worst. We had lots of messages in the queue, and our developers couldn’t see the new logs because the inputs were being throttled.
As far as I’ve seen in my experience, Graylog throttle the input not the node, so we stopped receiving logs from that input.

Thanks for the added info.

How about your elasticsearch config?

From your first post I did see a lot of message in the journal but there not being processed.
I can only think of couple things why Elasticsearch is not processing those messages in your Graylog node (016e0073).

Shard allocation might be uneven (just a thought), permission issues, different ES verions and elasticsearch configuration.

Can I ask what your elasticsearch config file looks like?

EDIT: Is the date/time on all these node the same?

This was the reason I stated about your “processbuffer” and “outputbuffer".

I believe the Message flow looks like this.
INPUT → Journal → INPUT_BUFFER → PROCESS_BUFFER → OUTPUT_BUFFER.

The reason I am asking about elasticsearch is that Graylog is holding all these messages in the journal on the node in question and therefore you see so many messages on that node then the other ones, hence not being processed by elasticsearch. So, if Graylogs journal is fill up then there maybe a problem with Elasticsearch accessing logs in Graylog journal to be processed.

That comes as a suprise to me. While the docs don’t explicitly state it, the metric that decision is based on is afterall node specific.

I would agree with that. You can get a “process buffer dump” via the Nodes Overview (/system/nodes) under the “More Actions” drop down menu. That gives you the currently processed messages. Sampling this might give you an idea which messages might be the offending respectively stuck ones.

Excellent nerd snipe. The order is not quite correct. After writing way to much confusing text, I got me some crayons:

Fun fact, the Local Kafka Queue Reader is called Local Kafka Queue Reader, because apparently the Journal uses the same format/technique.

That also means @Pourya, since you’re running a dedicated kafka cluster on the same nodes, you’re basically writing to disk twice. I’m wondering now if it makes sense to disable the journaling all together.
Maybe something to think about once this issue is resolved.

2 Likes

Thanks for the follow up answer
Well actually someone else configured the Elasticsearch and honestly I don’t know which options are the key factors here.
So far I’ve known that java memory should be important for us it is set to +50% of the physical memory.
“/etc/elasticsearch/esnode_nodename/log4j2.properties” has tons of fields, I could paste all of them here but I think it wouldn’t be necessary.
Should I look for anything special?

That’s a very good idea, which I haven’t thought about before. Thank you.

Great illustration

Yes, this cluster is actually our staging one, we have separated kafka and mongo in the production.
We’re thinking to put most of stuff onto the AWS managed service like using Amazon MSK instead of running Kafka and also use Amazon opensearch instead of running elasticsearch.
I believe it will less of a hassle.

As I mentioned before we have +100K of index failure messages in Graylog, we’ve been trying to trace them, but in some points we just can’t find the logs which are violating those rules and Graylog can’t show them to us, the only reference is a message ID which I have no idea how to trace.

Some examples:

Hello,

If you dont know already AWS Opensearch if forked off of Elastic-Stack (ELK Stack) and “Open Distro For Elasticsearch” is with AWS. I have install all of them and I dont really see a difference in the Elasticsearch installment yet. I think it was a couple months ago AWS Opensearch now has a forum and combine efforts with Open Distro for Elasticsearch. I believe this came about of licensing disagreement between AWS and Elastic.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.