Graylog not processing data

when i see in system/nodes/details, i can that the number of unprocessed message increase and not decrease, even when i stop all input.

i increase heap memory for graylog and elasticsearch, without solve this problem

elasticsearch 2.4.4
graylog-2.2-repository 1-5
graylog-server 2.2.2-1
mongodb-clients 1:2.4.10-5
mongodb-server 1:2.4.10-5

physique host with Intel® Xeon® CPU E5-1603 0 @ 2.80GHz, 8GB RAM, 1TB DD

/etc/default/graylog-server
JAVA=/usr/bin/java
GRAYLOG_SERVER_JAVA_OPTS="-Xms4g -Xmx4g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow"
GRAYLOG_SERVER_ARGS=""
GRAYLOG_COMMAND_WRAPPER=""

/etc/default/elasticsearch
ES_HEAP_SIZE=4g
ES_STARTUP_SLEEP_TIME=5

/etc/graylog/server/server.conf
is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = XXXX
root_password_sha2 = XXXX
root_timezone = Europe/Paris
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://X.X.X.X:9000/api/
rest_transport_uri = http://X.X.X.X:9000/api/
web_listen_uri = http://X.X.X.X:9000/
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_cluster_name = graylog
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 20
outputbuffer_processors = 15
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://localhost/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

/etc/elasticsearch/elasticsearch.yml
cluster.name: graylog
discovery.zen.minimum_master_nodes: 1

# curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "graylog",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 2,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 40,
  "active_shards" : 40,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

you should return outputbuffer_processor and processbuffer_processor back to default and raise the output_batch_size to 100.

Read the first paragraph of the posting and try running defaults before changing the settings randomly in your server.conf.

i do this but without change …
output_batch_size = 100
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3

ps. i work with only one node

In fact, i realize that some message was processed but the most part are deleted because journal’s rotating and because processing are very very slowly.
How I can do to speed up the calculation ?

Which calculation do you mean?

@flapegue

with your current setting

output_batch_size = 100
output_flush_interval = 1
outputbuffer_processors = 3

You are sending 3*100 messages per second to Graylog. Just for the reference the text from the above mentioned Blog Posting:

Next, in your Graylog server.conf, you can increase output_batch_size and adjust outputbuffer_processors to allow larger batches to be sent over fewer processors. The settings for output batch size and output buffer processors will vary depending on your environment. For our environment, we set output_batch_size to 5000 and outputbuffer_processors to 3 with a 31 GB heap memory for the Elasticsearch node.

So why not set your output_batch_size to 2000 and let the journal drain out?

I had done the test with several combinations
with output_batch_size = 100 and output_batch_size = 600, to see the impacts, but nothing change
i just try with output_batch_size = 2000 and no change, the number already increase

Last month I was in version 2.1 and I did not have this problem.
Could it come from this update ?

All settings to suit my configuration.

dear @flapegue

further investigation is very time consuming and would need to dig into your log files. We, from remote would only be able to look into that if you provide log files from elasticsearch an graylog.

additional you would need to give graylog some time to process your journal this can take up to a few hours.

Everything that follows now is just oracle and guessing.

You should look into your graylog logfile and look and read what might be the issue.

for the graylog-server log
https://raw.githubusercontent.com/flapegue/export/master/graylog-server.log
and for elasticsearch log
https://raw.githubusercontent.com/flapegue/export/master/elasticsearch.log

thanks

Not sure what the issue is within your setup as all information given from your end (until now) look normal.

For me further investigation is not possible, said that if you need help go over to get professional support.

I thank you already for the help you have given me.

I will not be able to start from a paid support … but will the solution to erase the elasticsearch indexes will restore the service ?
I use an LVM logical partition to store my indexes, partition that I have increased several times, could the problem come from there ?

hi,

I use your example of output batch size 5000 and number of processors 3, and the graylog nodes are some times limited to 15k messages per second.

Is there somewhere a guide on how big batch sizes are recommended, versus the elasticsearch cluster settings? I.e. can I just double the 5000 to 10000 and be safe, or is there some other parameter that I should consider?