Graylog Cluster, Buffer process 100% stop process messages


(Rafaelcarsetimo) #1

Hello Sirs,
I know there are several reports of this case, but I am following all the possibilities that I found in the forum, but without success.
Basically, I have 3 nodes. They have a processing capacity each at about 1600m / s. But intermittently one of them stop process the messages but still send to journal… and the only way to re-process is to restart the Graylog service. But after doing this, a few minutes later, another node stops processing the messages and again I have to restart the Graylog service. The problem occurs in all nodes after a non periodic time, and one by one. I believed that it happened because of a bad message formation of a Fortigate / Fortinet log, but I corrected it with help here from the Forum, treating the messages as RAW. No logs from graylog and elasticsearch, even in debug mode give me clues to what’s going on.

3 Nodes with: 16 VCPUS, 24GB RAM, FC Disc Storage 3PAR - CentOS 7 Updated

CONFS

conf_graylog_node_1

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = xxxxxxxxxxxxxxx
root_username = admin
root_password_sha2 = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
root_timezone = America/Sao_Paulo
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://192.168.0.195:9000/api/
web_listen_uri = http://192.168.0.195:9000/
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 1
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_cluster_name = graylog
elasticsearch_discovery_zen_ping_unicast_hosts = 192.168.0.195:9300, 192.168.0.196:9300, 192.168.1.187
elasticsearch_cluster_discovery_timeout = 15000
elasticsearch_network_host = 192.168.0.195
elasticsearch_discovery_initial_state_timeout = 10s
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 16
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog1:27017,graylog2:27017,graylog3:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

conf_graylog_node_2

is_master = false
node_id_file = /etc/graylog/server/node-id
password_secret = xxxxxxxxxxxxxxx
root_username = admin
root_password_sha2 = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
root_timezone = America/Sao_Paulo
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://192.168.0.196:9000/api/
web_listen_uri = http://192.168.0.196:9000/
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 1
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_cluster_name = graylog
elasticsearch_discovery_zen_ping_unicast_hosts = 192.168.0.195:9300, 192.168.0.196:9300, 192.168.1.187
elasticsearch_cluster_discovery_timeout = 15000
elasticsearch_network_host = 192.168.0.196
elasticsearch_discovery_initial_state_timeout = 10s
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 16
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog1:27017,graylog2:27017,graylog3:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

conf_graylog_node_3

is_master = false
node_id_file = /etc/graylog/server/node-id
password_secret = xxxxxxxxxxxxxxx
root_username = admin
root_password_sha2 = xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
root_timezone = America/Sao_Paulo
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://192.168.1.187:9000/api/
web_listen_uri = http://192.168.1.187:9000/
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 1
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_cluster_name = graylog
elasticsearch_discovery_zen_ping_unicast_hosts = 192.168.0.195:9300, 192.168.0.196:9300, 192.168.1.187
elasticsearch_cluster_discovery_timeout = 15000
elasticsearch_network_host = 192.168.1.187
elasticsearch_discovery_initial_state_timeout = 10s
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 16
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog1:27017,graylog2:27017,graylog3:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

elasticsearch_conf_node_1

cluster.name: graylog
node.name: graylog1.example.com.br
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: [“192.168.0.195”, “192.168.0.196”, “192.168.1.187”]
index.refresh_interval: 30s
index.translog.flush_threshold_ops: 50000

elasticsearch_conf_node_2

cluster.name: graylog
node.name: graylog2.example.com.br
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: [“192.168.0.195”, “192.168.0.196”, “192.168.1.187”]
index.refresh_interval: 30s
index.translog.flush_threshold_ops: 50000

elasticsearch_conf_node_3

cluster.name: graylog
node.name: graylog3.example.com.br
network.host: 0.0.0.0
discovery.zen.ping.unicast.hosts: [“192.168.0.195”, “192.168.0.196”, “192.168.1.187”]
index.refresh_interval: 30s
index.translog.flush_threshold_ops: 50000

etc_sysconfig_elasticsearch ALL NODES

ES_HEAP_SIZE=12g
ES_STARTUP_SLEEP_TIME=5

etc_sysconfig_graylog-server ALL NODES

JAVA=/usr/bin/java
GRAYLOG_SERVER_JAVA_OPTS="-Xms6g -Xmx6g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow"
GRAYLOG_SERVER_ARGS="“
GRAYLOG_COMMAND_WRAPPER=”"

Thread dump of node 558b0d80 / graylog2.example.com.br NOT PROCESSING NOW

https://paste.ee/p/AU1ff


Unable to decode raw message RawMessage cause 100% ProcessBuffer [Solved Partial]
Processors buffer configuration, process buffer 100%
(Jan Doberstein) #2

If you have everything on one node - that means you have 3 Nodes that have Graylog, MongoDB and Elasticsearch running your setup and configuration is not well balanced.

Basically your Elasticsearch has not enough power to get all the data.

If you need help in this, please check the enterprise service and support as this is way more than we are able to do in the community support.

/jd


(Rafaelcarsetimo) #3

Hi @jan.

This Graylog implementation project is a POC, and if it works correctly it has great chances of being a vital tool for my client. I will not be able to budget support before making the tool work properly. So my first attempt was the community. I will try to separate the applications and see what happens.
Thanks again.


(Jan Doberstein) #4

Hej @rafaelcarsetimo

you are welcome and maybe someone from the community is able to help you with this. But diving that deep into someones environment is something that the Graylog Core Team can not provide for free.

regards
Jan


(Rafaelcarsetimo) #5

Hi @jan

First of all, thank you very much. But I did what you told me, separate Graylog and Elasticsearch into separate nodes. My infrastructure was with 5 nodes Graylog and 3 Elasticsearchs.
But I noticed the following, ElasticSearch is not the problem, it has excellent preformance, but it seems to me some message (s) that arrive in the graylog and lock the queue filling the processbuffer.
If I wait a long time the queue works again, but as I sometimes believe there are too many ringbacks (65000) that take too long and just restarting the node’s graylog it “kills” those messages in the ring and Kafka’s queue starts to decrease.
My question is if I have how to know which messages are in the process buffer when it gets stuck. This dump would help me understand what might be happening. For it can be from a message with malformation, or maybe some extractor can not do his job effectively.

Thank you!


#6

you can check extractor performance, you probably have some extractors that do backtracking. See e.g. http://www.regular-expressions.info/catastrophic.html
This seems also be a problem with some GROK extractors

You can look at what regex is the problem by going to System/Inputs, then Edit Extractors. Look at “Details” from the extractor list and you will see which extractors take long to execute.


(Rafaelcarsetimo) #7

Hi @jtkarvo,

Thanks for your help. Sorry, but I do not understand much about those values of the metrics, how do I know if it’s good or bad?

Here is an example:

Can you explain for me please?

How can I get this value from the Graylog API? Because I could monitor in Zabbix, knowing what takes and what is fast I can create an alert trigger, so I’ll know when a puller is slow.

Regards.


#8

yes. This is an example of pretty bad performance with regex. A simple regex in my setup typically runs in 20-50 microseconds.

The question is; how to make the regex more efficient.

  • First step: get rid of .* in the beginning and in the end. They consume time for no real purpose.
  • Second step: use an atomic expression in the middle: (?>(.*)) - that reduces backtracking
  • Third step: add the condition to the extractor that it runs only if the field contains string <UN_ACTIVIDADE2>
  • Fourth step: if the field UN_ACTIVIDADE2 will never contain the character < replace the expression with

<UN_ACTIVIDADE2>(?>([^<]*))</UN_ACTIVIDADE2>

You could possibly be able to optimize even more, but my current skills stop here.

I don’t know about the API for monitoring, but you don’t really need that to get significant results; you can just walk through all your inputs and their metrics, and optimize all regexes that consume more than e.g. 500 microseconds total time maximum. The only cumbersome thing with this approach is that to see the results of optimization in the metrics, you need to restart the graylog cluster. Of course, reboot is not necessary to see the results in practice; you can see that the figure on the top right corner of UI (Out messages per second) will increase significantly.


(Rafaelcarsetimo) #9

First, thank you. I will review all my extractors.
You know this tool? https://www.regexbuddy.com/debug.html


#10

Related to this, I made a feature request on the product ideas portal for a pipeline function that would make this kind of extractions easy and efficient (HTML tagmatcher for pipelines)

… no, I have not used that regexbuddy tool.


(Shamim Reza Sohag) #11

Hi There,

Did you manage to over come your issues ?

Cause i am also in to the same situation and i cant actually specify the Hardware for my graylog solution with 150K log per second, about how many servers with which specification would be better !!!

i know the basic and already tested it but as its a POC so that i cant go to Support contract right now.