Graylog Cluster, Buffer process 100% stop process messages

Hello Sirs,
I know there are several reports of this case, but I am following all the possibilities that I found in the forum, but without success.
Basically, I have 3 nodes. They have a processing capacity each at about 1600m / s. But intermittently one of them stop process the messages but still send to journal… and the only way to re-process is to restart the Graylog service. But after doing this, a few minutes later, another node stops processing the messages and again I have to restart the Graylog service. The problem occurs in all nodes after a non periodic time, and one by one. I believed that it happened because of a bad message formation of a Fortigate / Fortinet log, but I corrected it with help here from the Forum, treating the messages as RAW. No logs from graylog and elasticsearch, even in debug mode give me clues to what’s going on.

3 Nodes with: 16 VCPUS, 24GB RAM, FC Disc Storage 3PAR - CentOS 7 Updated

CONFS

conf_graylog_node_1

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = xxxxxxxxxxxxxxx
root_username = admin
root_password_sha2 = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
root_timezone = America/Sao_Paulo
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://192.168.0.195:9000/api/
web_listen_uri = http://192.168.0.195:9000/
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 1
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_cluster_name = graylog
elasticsearch_discovery_zen_ping_unicast_hosts = 192.168.0.195:9300, 192.168.0.196:9300, 192.168.1.187
elasticsearch_cluster_discovery_timeout = 15000
elasticsearch_network_host = 192.168.0.195
elasticsearch_discovery_initial_state_timeout = 10s
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 16
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog1:27017,graylog2:27017,graylog3:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

conf_graylog_node_2

is_master = false
node_id_file = /etc/graylog/server/node-id
password_secret = xxxxxxxxxxxxxxx
root_username = admin
root_password_sha2 = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
root_timezone = America/Sao_Paulo
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://192.168.0.196:9000/api/
web_listen_uri = http://192.168.0.196:9000/
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 1
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_cluster_name = graylog
elasticsearch_discovery_zen_ping_unicast_hosts = 192.168.0.195:9300, 192.168.0.196:9300, 192.168.1.187
elasticsearch_cluster_discovery_timeout = 15000
elasticsearch_network_host = 192.168.0.196
elasticsearch_discovery_initial_state_timeout = 10s
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 16
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog1:27017,graylog2:27017,graylog3:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

conf_graylog_node_3

is_master = false
node_id_file = /etc/graylog/server/node-id
password_secret = xxxxxxxxxxxxxxx
root_username = admin
root_password_sha2 = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
root_timezone = America/Sao_Paulo
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://192.168.1.187:9000/api/
web_listen_uri = http://192.168.1.187:9000/
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 1
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_cluster_name = graylog
elasticsearch_discovery_zen_ping_unicast_hosts = 192.168.0.195:9300, 192.168.0.196:9300, 192.168.1.187
elasticsearch_cluster_discovery_timeout = 15000
elasticsearch_network_host = 192.168.1.187
elasticsearch_discovery_initial_state_timeout = 10s
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 16
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog1:27017,graylog2:27017,graylog3:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

elasticsearch_conf_node_1

cluster.name: graylog
node.name: graylog1.example.com.br
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: [“192.168.0.195”, “192.168.0.196”, “192.168.1.187”]
index.refresh_interval: 30s
index.translog.flush_threshold_ops: 50000

elasticsearch_conf_node_2

cluster.name: graylog
node.name: graylog2.example.com.br
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: [“192.168.0.195”, “192.168.0.196”, “192.168.1.187”]
index.refresh_interval: 30s
index.translog.flush_threshold_ops: 50000

elasticsearch_conf_node_3

cluster.name: graylog
node.name: graylog3.example.com.br
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: [“192.168.0.195”, “192.168.0.196”, “192.168.1.187”]
index.refresh_interval: 30s
index.translog.flush_threshold_ops: 50000

etc_sysconfig_elasticsearch ALL NODES

ES_HEAP_SIZE=12g
ES_STARTUP_SLEEP_TIME=5

etc_sysconfig_graylog-server ALL NODES

JAVA=/usr/bin/java
GRAYLOG_SERVER_JAVA_OPTS="-Xms6g -Xmx6g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow"
GRAYLOG_SERVER_ARGS=""
GRAYLOG_COMMAND_WRAPPER=""

Thread dump of node 558b0d80 / graylog2.example.com.br NOT PROCESSING NOW

https://paste.ee/p/AU1ff

2 Likes

If you have everything on one node - that means you have 3 Nodes that have Graylog, MongoDB and Elasticsearch running your setup and configuration is not well balanced.

Basically your Elasticsearch has not enough power to get all the data.

If you need help in this, please check the enterprise service and support as this is way more than we are able to do in the community support.

/jd

Hi @jan.

This Graylog implementation project is a POC, and if it works correctly it has great chances of being a vital tool for my client. I will not be able to budget support before making the tool work properly. So my first attempt was the community. I will try to separate the applications and see what happens.
Thanks again.

Hej @rafaelcarsetimo

you are welcome and maybe someone from the community is able to help you with this. But diving that deep into someones environment is something that the Graylog Core Team can not provide for free.

regards
Jan

1 Like

Hi @jan

First of all, thank you very much. But I did what you told me, separate Graylog and Elasticsearch into separate nodes. My infrastructure was with 5 nodes Graylog and 3 Elasticsearchs.
But I noticed the following, ElasticSearch is not the problem, it has excellent preformance, but it seems to me some message (s) that arrive in the graylog and lock the queue filling the processbuffer.
If I wait a long time the queue works again, but as I sometimes believe there are too many ringbacks (65000) that take too long and just restarting the node’s graylog it “kills” those messages in the ring and Kafka’s queue starts to decrease.
My question is if I have how to know which messages are in the process buffer when it gets stuck. This dump would help me understand what might be happening. For it can be from a message with malformation, or maybe some extractor can not do his job effectively.

Thank you!

you can check extractor performance, you probably have some extractors that do backtracking. See e.g. http://www.regular-expressions.info/catastrophic.html
This seems also be a problem with some GROK extractors

You can look at what regex is the problem by going to System/Inputs, then Edit Extractors. Look at “Details” from the extractor list and you will see which extractors take long to execute.

1 Like

Hi @jtkarvo,

Thanks for your help. Sorry, but I do not understand much about those values of the metrics, how do I know if it’s good or bad?

Here is an example:

Can you explain for me please?

How can I get this value from the Graylog API? Because I could monitor in Zabbix, knowing what takes and what is fast I can create an alert trigger, so I’ll know when a puller is slow.

Regards.

yes. This is an example of pretty bad performance with regex. A simple regex in my setup typically runs in 20-50 microseconds.

The question is; how to make the regex more efficient.

  • First step: get rid of .* in the beginning and in the end. They consume time for no real purpose.
  • Second step: use an atomic expression in the middle: (?>(.*)) - that reduces backtracking
  • Third step: add the condition to the extractor that it runs only if the field contains string <UN_ACTIVIDADE2>
  • Fourth step: if the field UN_ACTIVIDADE2 will never contain the character < replace the expression with

<UN_ACTIVIDADE2>(?>([^<]*))</UN_ACTIVIDADE2>

You could possibly be able to optimize even more, but my current skills stop here.

I don’t know about the API for monitoring, but you don’t really need that to get significant results; you can just walk through all your inputs and their metrics, and optimize all regexes that consume more than e.g. 500 microseconds total time maximum. The only cumbersome thing with this approach is that to see the results of optimization in the metrics, you need to restart the graylog cluster. Of course, reboot is not necessary to see the results in practice; you can see that the figure on the top right corner of UI (Out messages per second) will increase significantly.

1 Like

First, thank you. I will review all my extractors.
You know this tool? https://www.regexbuddy.com/debug.html

Related to this, I made a feature request on the product ideas portal for a pipeline function that would make this kind of extractions easy and efficient (HTML tagmatcher for pipelines)

… no, I have not used that regexbuddy tool.

Hi There,

Did you manage to over come your issues ?

Cause i am also in to the same situation and i cant actually specify the Hardware for my graylog solution with 150K log per second, about how many servers with which specification would be better !!!

i know the basic and already tested it but as its a POC so that i cant go to Support contract right now.

Hello,

Anyone found a solution for this issue?

I’ve been running into this. And the solution that I’ve been implementing is the deletion of the Journal would like to find a permanent solution.

Best regards,
Alcides

my setup last i did the perf test
this is 3 GS servers 3 mongo and 3 ES nodes… test was syslog input
the servers are 6 HP bl 460 Gen9 with 64 Gig ram and dual core. disk is EMC san.
and by using extractor i would more recommend to use pipelines instead so u just route it to the streams that contains the logsource/s and avoid GROK patterns totally…
and by using the extractor it will be hard to avoid all the diffrent logsources not to match the input u have your extractors
and always use as i see it a LB in-front of this and configure the lb settings in graylog/server.conf to
like 10% “lb_throttle_threshold_percentage = 10” of journal disk fs and then in the LB with health checks moves the load when hit the 10% to the other node so u dont end up in journal disk to be full and end up in corrupted journal when hit 100%

// Anders
benchmark_graylog

I haven’t seen on thing, a picture about the node’s buffer.
If only the process buffer full, you have to check your graylog.
if the output and the process, you need to check elasticsearch.

Hello @jan can you please help understand what is the amount of logs that community edition is able to handle?

Because I have a setup with one single server (MongoDB, Elasticsearch and Graylog) with only three inputs and receiving logs from 5 equipaments and I never had my server stable.

Regards,

Perhaps it would have been better to put this in a new thread of your own, instead of piggy-backing on this one. Unless of course you are seeing the same buffering issues.

To my knowledge there are no stability differences between the Community and Enterprise editions. Both can handle equal amounts of data and throughput. They are limited by the underlying hardware and software in the exact same way.

one single server (MongoDB, Elasticsearch and Graylog) with only three inputs and receiving logs from 5 equipaments and I never had my server stable.

To help with that situation we’ll need a lot more input :slight_smile:

  • What do you mean when you say your server “has never been stable”? What are the symptoms?
  • What OS are you running on?
  • What kind of resources does the server have? Think: CPU, RAM, storage types and space.
  • How much RAM have you assigned to the Graylog java heap?
  • How much RAM have you assigned to the Elastic java heap?
  • What does your system’s activity viewer (like top on a Unix) tell you? How busy are your CPUs? How full is your RAM? Is there any swapping going on?

All of these things can affect your system’s performance and stability.

1 Like

In our environment:
5-6k log/sec, ~400,000,000 messages/day
13TB data (with 1 replica), for 30 days
search about 2-3 sec in last 30 days (one simple word)
6-700 server send log
54 inputs, lot of extractors, ~10 pipelines.
The full graylog infrastructure ~ 20 servers, about 20-30% resource usage

I excitedly waiting to start to use the system, and increase the log amount.

1 Like

Bwahaha :smiley: 400e6 messages per day, with an infra spanning 20 boxen and you haven’t even started yet? Holy moley! You live an exciting life Macko!

OFF
It should be redundant, so split it half :slight_smile:
Exciting? No. Graylog works well, so i have not much work with it.

1 Like

To my knowledge there are no stability differences between the Community and Enterprise editions. Both can handle equal amounts of data and throughput. They are limited by the underlying hardware and software in the exact same way.

@a-ml so @Totally_Not_A_Robot nailed it - no difference between OSS and Enterprise from a core point of view.

The difference are add-ons that does not come into play with message processing.

1 Like