Uneven distribution of unprocessed messages in graylog nodes

teddy · April 29, 2022, 10:10am

Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question.
Don’t forget to select tags to help index your topic!

1. Describe your incident:
This topic Uneven distribution of unprocessed messages is insufficient for us so we require help from graylog community.
We have uneven distribution of unprocessed messages in ours graylog nodes.

2. Describe your environment:
9 graylog nodes:

Red Hat Enterprise Linux Server release 7.4 (Maipo)
RAM: 15G
CPU: 8
7 es nodes:
Red Hat Enterprise Linux Server release 7.4 (Maipo)
RAM: 56G
CPU: 6
Service logs, configurations, and environment variables:

Graylog configurations:
server.conf

# WARNING: Maintained by Puppet, manual changes will be lost!

allow_highlighting = true
allow_leading_wildcard_searches = true
bin_dir = /usr/share/graylog-server/bin
data_dir = /graylog/data
elasticsearch_hosts = http://GRAYLOG-elasticsearch.service.swmconsul:9200
elasticsearch_index_optimization_jobs = 100
elasticsearch_max_total_connections = 100
http_bind_address = 0.0.0.0:9000
inputbuffer_processors = 2
is_master = true
message_journal_dir = /graylog/journal
message_journal_flush_interval = 1000000
message_journal_max_age = 72h
message_journal_max_size = 80g
message_journal_segment_age = 1h
mongodb_uri = mongodb://GRAYLOG-mongodb.service.swmconsul:27017/admin?replicaSet=graylog
node_id_file = /graylog/node-id
output_barch_size = 1000
outputbuffer_processors = 2
plugin_dir = /usr/share/graylog-server/plugin
processbuffer_processors = 4
ring_size = 131072

es configurations:

elasticsearch.yml

### MANAGED BY PUPPET ###
---
bootstrap.memory_lock: 'true'
cluster.name: GRAYLOG-cluster
cluster.remote.connect: 'false'
discovery.zen.minimum_master_nodes: 3
discovery.zen.ping.unicast.hosts:
- graylog-esmaster1.swmcloud.net
- graylog-esmaster2.swmcloud.net
- graylog-esmaster3.swmcloud.net
- graylog-esmaster4.swmcloud.net
discovery.zen.ping.unicast.hosts.resolve_timeout: 5s
indices.breaker.accounting.limit: 50%
indices.breaker.fielddata.limit: 50%
indices.breaker.request.limit: 25%
indices.fielddata.cache.size: 60%
indices.recovery.max_bytes_per_sec: 150mb
network.breaker.inflight_requests.limit: 25%
network.host: 0.0.0.0
node.data: true
node.ingest: 'false'
node.master: false
node.name: graylog-esnode1
path.data: "/data/GRAYLOG-cluster/graylog-esnode1"
path.logs: "/var/log/elasticsearch/graylog-esnode1"

jvm.options

# This file is managed by Puppet -- graylog-esnode1
#
# Set the 'jvm_options' parameter on the elasticsearch class to change this file.

-Dfile.encoding=UTF-8
-Dio.netty.noKeySetOptimization=true
-Dio.netty.noUnsafe=true
-Dio.netty.recycler.maxCapacityPerThread=0
-Djava.awt.headless=true
-Djna.nosys=true
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-XX:+AlwaysPreTouch
-XX:+ExitOnOutOfMemoryError
-XX:+UseCMSInitiatingOccupancyOnly
-XX:-HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
-XX:CMSInitiatingOccupancyFraction=75
-Xms28g
-Xmx28g
-Xss1m
-server
11-:-XX:+UseG1GC
11-:-XX:MaxGCPauseMillis=300
8:-XX:+PrintGCApplicationStoppedTime
8:-XX:+PrintGCDateStamps
8:-XX:+PrintGCDetails
8:-XX:+PrintTenuringDistribution
8:-XX:+UseConcMarkSweepGC
8:-XX:+UseGCLogFileRotation
8:-XX:GCLogFileSize=64m
8:-XX:NumberOfGCLogFiles=10
8:-Xloggc:/var/log/elasticsearch/gc.log

cluster setting:

{
  "persistent" : {
    "indices" : {
      "breaker" : {
        "fielddata" : {
          "limit" : "75%"
        },
        "request" : {
          "limit" : "25%"
        },
        "accounting" : {
          "limit" : "50%"
        }
      }
    },
    "network" : {
      "breaker" : {
        "inflight_requests" : {
          "limit" : "25%"
        }
      }
    }
  },
  "transient" : { },
  "defaults" : {
    "cluster" : {
      "routing" : {
        "use_adaptive_replica_selection" : "false",
        "rebalance" : {
          "enable" : "all"
        },
        "allocation" : {
          "node_concurrent_incoming_recoveries" : "2",
          "node_initial_primaries_recoveries" : "4",
          "same_shard" : {
            "host" : "false"
          },
          "total_shards_per_node" : "-1",
          "type" : "balanced",
          "disk" : {
            "threshold_enabled" : "true",
            "watermark" : {
              "low" : "85%",
              "flood_stage" : "95%",
              "high" : "90%"
            },
            "include_relocations" : "true",
            "reroute_interval" : "60s"
          },
          "awareness" : {
            "attributes" : [ ]
          },
          "balance" : {
            "index" : "0.55",
            "threshold" : "1.0",
            "shard" : "0.45"
          },
          "enable" : "all",
          "node_concurrent_outgoing_recoveries" : "2",
          "allow_rebalance" : "indices_all_active",
          "cluster_concurrent_rebalance" : "2",
          "node_concurrent_recoveries" : "2"
        }
      },
      "indices" : {
        "tombstones" : {
          "size" : "500"
        },
        "close" : {
          "enable" : "true"
        }
      },
      "nodes" : {
        "reconnect_interval" : "10s"
      },
      "persistent_tasks" : {
        "allocation" : {
          "enable" : "all",
          "recheck_interval" : "30s"
        }
      },
      "blocks" : {
        "read_only_allow_delete" : "false",
        "read_only" : "false"
      },
      "service" : {
        "slow_task_logging_threshold" : "30s"
      },
      "name" : "GRAYLOG-cluster",
      "max_shards_per_node" : "1000",
      "remote" : {
        "node" : {
          "attr" : ""
        },
        "initial_connect_timeout" : "30s",
        "connect" : "false",
        "connections_per_cluster" : "3"
      },
      "info" : {
        "update" : {
          "interval" : "30s",
          "timeout" : "15s"
        }
      }
    },
    "no" : {
      "model" : {
        "state" : {
          "persist" : "false"
        }
      }
    },
    "logger" : {
      "level" : "INFO"
    },
    "bootstrap" : {
      "memory_lock" : "true",
      "system_call_filter" : "true",
      "ctrlhandler" : "true"
    },
    "processors" : "6",
    "ingest" : {
      "geoip" : {
        "cache_size" : "1000"
      },
      "grok" : {
        "watchdog" : {
          "max_execution_time" : "1s",
          "interval" : "1s"
        }
      }
    },
    "network" : {
      "host" : [
        "0.0.0.0"
      ],
      "tcp" : {
        "reuse_address" : "true",
        "keep_alive" : "true",
        "connect_timeout" : "30s",
        "receive_buffer_size" : "-1b",
        "no_delay" : "true",
        "send_buffer_size" : "-1b"
      },
      "bind_host" : [
        "0.0.0.0"
      ],
      "server" : "true",
      "breaker" : {
        "inflight_requests" : {
          "overhead" : "1.0"
        }
      },
      "publish_host" : [
        "0.0.0.0"
      ]
    },
    "pidfile" : "/var/run/elasticsearch/elasticsearch-graylog-esnode1.pid",
    "path" : {
      "data" : [
        "/data/GRAYLOG-cluster/graylog-esnode1"
      ],
      "logs" : "/var/log/elasticsearch/graylog-esnode1",
      "shared_data" : "",
      "home" : "/usr/share/elasticsearch",
      "repo" : [ ]
    },
    "search" : {
      "default_search_timeout" : "-1",
      "highlight" : {
        "term_vector_multi_value" : "true"
      },
      "default_allow_partial_results" : "true",
      "max_open_scroll_context" : "2147483647",
      "max_buckets" : "-1",
      "low_level_cancellation" : "false",
      "keep_alive_interval" : "1m",
      "remote" : {
        "node" : {
          "attr" : ""
        },
        "initial_connect_timeout" : "30s",
        "connect" : "true",
        "connections_per_cluster" : "3"
      },
      "default_keep_alive" : "5m",
      "max_keep_alive" : "24h"
    },
    "security" : {
      "manager" : {
        "filter_bad_defaults" : "true"
      }
    },
    "ccr" : {
      "wait_for_metadata_timeout" : "60s",
      "indices" : {
        "recovery" : {
          "recovery_activity_timeout" : "60s",
          "chunk_size" : "1mb",
          "internal_action_timeout" : "60s",
          "max_bytes_per_sec" : "40mb",
          "max_concurrent_file_chunks" : "5"
        }
      },
      "auto_follow" : {
        "wait_for_metadata_timeout" : "60s"
      }
    },
    "repositories" : {
      "fs" : {
        "compress" : "false",
        "chunk_size" : "9223372036854775807b",
        "location" : ""
      },
      "url" : {
        "supported_protocols" : [
          "http",
          "https",
          "ftp",
          "file",
          "jar"
        ],
        "allowed_urls" : [ ],
        "url" : "http:"
      }
    },
    "action" : {
      "auto_create_index" : "true",
      "search" : {
        "shard_count" : {
          "limit" : "9223372036854775807"
        }
      },
      "destructive_requires_name" : "false",
      "master" : {
        "force_local" : "false"
      }
    },
    "client" : {
      "type" : "node",
      "transport" : {
        "ignore_cluster_name" : "false",
        "nodes_sampler_interval" : "5s",
        "sniff" : "false",
        "ping_timeout" : "5s"
      }
    },
    "xpack" : {
      "watcher" : {
        "execution" : {
          "scroll" : {
            "size" : "0",
            "timeout" : ""
          },
          "default_throttle_period" : "5s"
        },
        "internal" : {
          "ops" : {
            "bulk" : {
              "default_timeout" : ""
            },
            "index" : {
              "default_timeout" : ""
            },
            "search" : {
              "default_timeout" : ""
            }
          }
        },
        "thread_pool" : {
          "queue_size" : "1000",
          "size" : "30"
        },
        "index" : {
          "rest" : {
            "direct_access" : ""
          }
        },
        "history" : {
          "cleaner_service" : {
            "enabled" : "true"
          }
        },
        "trigger" : {
          "schedule" : {
            "ticker" : {
              "tick_interval" : "500ms"
            }
          }
        },
        "enabled" : "true",
        "input" : {
          "search" : {
            "default_timeout" : ""
          }
        },
        "encrypt_sensitive_data" : "false",
        "transform" : {
          "search" : {
            "default_timeout" : ""
          }
        },
        "stop" : {
          "timeout" : "30s"
        },
        "watch" : {
          "scroll" : {
            "size" : "0"
          }
        },
        "require_manual_start" : "false",
        "bulk" : {
          "concurrent_requests" : "0",
          "flush_interval" : "1s",
          "size" : "1mb",
          "actions" : "1"
        },
        "actions" : {
          "bulk" : {
            "default_timeout" : ""
          },
          "index" : {
            "default_timeout" : ""
          }
        }
      },
      "ilm" : {
        "enabled" : "true"
      },
      "monitoring" : {
        "collection" : {
          "cluster" : {
            "stats" : {
              "timeout" : "10s"
            }
          },
          "node" : {
            "stats" : {
              "timeout" : "10s"
            }
          },
          "indices" : [ ],
          "ccr" : {
            "stats" : {
              "timeout" : "10s"
            }
          },
          "index" : {
            "stats" : {
              "timeout" : "10s"
            },
            "recovery" : {
              "active_only" : "false",
              "timeout" : "10s"
            }
          },
          "interval" : "10s",
          "enabled" : "false",
          "ml" : {
            "job" : {
              "stats" : {
                "timeout" : "10s"
              }
            }
          }
        },
        "history" : {
          "duration" : "168h"
        },
        "elasticsearch" : {
          "collection" : {
            "enabled" : "true"
          }
        },
        "enabled" : "true"
      },
      "graph" : {
        "enabled" : "true"
      },
      "rollup" : {
        "enabled" : "true",
        "task_thread_pool" : {
          "queue_size" : "4",
          "size" : "4"
        }
      },
      "sql" : {
        "enabled" : "true"
      },
      "license" : {
        "self_generated" : {
          "type" : "basic"
        }
      },
      "logstash" : {
        "enabled" : "true"
      },
      "notification" : {
        "hipchat" : {
          "host" : "",
          "port" : "443",
          "default_account" : ""
        },
        "pagerduty" : {
          "default_account" : ""
        },
        "email" : {
          "default_account" : "",
          "html" : {
            "sanitization" : {
              "allow" : [
                "body",
                "head",
                "_tables",
                "_links",
                "_blocks",
                "_formatting",
                "img:embedded"
              ],
              "disallow" : [ ],
              "enabled" : "true"
            }
          }
        },
        "reporting" : {
          "retries" : "40",
          "interval" : "15s"
        },
        "jira" : {
          "default_account" : ""
        },
        "slack" : {
          "default_account" : ""
        }
      },
      "security" : {
        "dls_fls" : {
          "enabled" : "true"
        },
        "transport" : {
          "filter" : {
            "allow" : [ ],
            "deny" : [ ],
            "enabled" : "true"
          },
          "ssl" : {
            "enabled" : "false"
          }
        },
        "enabled" : "true",
        "filter" : {
          "always_allow_bound_address" : "true"
        },
        "encryption" : {
          "algorithm" : "AES/CTR/NoPadding"
        },
        "audit" : {
          "outputs" : [
            "logfile"
          ],
          "index" : {
            "bulk_size" : "1000",
            "rollover" : "DAILY",
            "flush_interval" : "1s",
            "events" : {
              "emit_request_body" : "false",
              "include" : [
                "ACCESS_DENIED",
                "ACCESS_GRANTED",
                "ANONYMOUS_ACCESS_DENIED",
                "AUTHENTICATION_FAILED",
                "REALM_AUTHENTICATION_FAILED",
                "CONNECTION_DENIED",
                "CONNECTION_GRANTED",
                "TAMPERED_REQUEST",
                "RUN_AS_DENIED",
                "RUN_AS_GRANTED",
                "AUTHENTICATION_SUCCESS"
              ],
              "exclude" : [ ]
            },
            "queue_max_size" : "10000"
          },
          "enabled" : "false",
          "logfile" : {
            "emit_node_id" : "true",
            "emit_node_host_name" : "false",
            "emit_node_name" : "true",
            "events" : {
              "emit_request_body" : "false",
              "include" : [
                "ACCESS_DENIED",
                "ACCESS_GRANTED",
                "ANONYMOUS_ACCESS_DENIED",
                "AUTHENTICATION_FAILED",
                "CONNECTION_DENIED",
                "TAMPERED_REQUEST",
                "RUN_AS_DENIED",
                "RUN_AS_GRANTED"
              ],
              "exclude" : [ ]
            },
            "prefix" : {
              "emit_node_host_name" : "false",
              "emit_node_name" : "true",
              "emit_node_host_address" : "false"
            },
            "emit_node_host_address" : "false"
          }
        },
        "authc" : {
          "password_hashing" : {
            "algorithm" : "bcrypt"
          },
          "success_cache" : {
            "size" : "10000",
            "enabled" : "false",
            "expire_after_access" : "1h"
          },
          "api_key" : {
            "cache" : {
              "hash_algo" : "ssha256",
              "max_keys" : "10000",
              "ttl" : "24h"
            },
            "delete" : {
              "interval" : "24h",
              "timeout" : "-1"
            },
            "enabled" : "false",
            "hashing" : {
              "algorithm" : "pbkdf2"
            }
          },
          "anonymous" : {
            "authz_exception" : "true",
            "roles" : [ ],
            "username" : "_anonymous"
          },
          "run_as" : {
            "enabled" : "true"
          },
          "reserved_realm" : {
            "enabled" : "true"
          },
          "token" : {
            "compat" : {
              "enabled" : "false"
            },
            "delete" : {
              "interval" : "30m",
              "timeout" : "-1"
            },
            "enabled" : "false",
            "thread_pool" : {
              "queue_size" : "1000",
              "size" : "1"
            },
            "timeout" : "20m"
          }
        },
        "fips_mode" : {
          "enabled" : "false"
        },
        "encryption_key" : {
          "length" : "128",
          "algorithm" : "AES"
        },
        "http" : {
          "filter" : {
            "allow" : [ ],
            "deny" : [ ],
            "enabled" : "true"
          },
          "ssl" : {
            "enabled" : "false"
          }
        },
        "automata" : {
          "max_determinized_states" : "100000",
          "cache" : {
            "size" : "10000",
            "ttl" : "48h",
            "enabled" : "true"
          }
        },
        "user" : null,
        "authz" : {
          "store" : {
            "roles" : {
              "index" : {
                "cache" : {
                  "ttl" : "20m",
                  "max_size" : "10000"
                }
              },
              "cache" : {
                "max_size" : "10000"
              },
              "negative_lookup_cache" : {
                "max_size" : "10000"
              },
              "field_permissions" : {
                "cache" : {
                  "max_size_in_bytes" : "104857600"
                }
              }
            }
          }
        }
      },
      "ccr" : {
        "enabled" : "true",
        "ccr_thread_pool" : {
          "queue_size" : "100",
          "size" : "32"
        }
      },
      "http" : {
        "default_connection_timeout" : "10s",
        "proxy" : {
          "host" : "",
          "scheme" : "",
          "port" : "0"
        },
        "default_read_timeout" : "10s",
        "max_response_size" : "10mb"
      },
      "ml" : {
        "utility_thread_pool" : {
          "queue_size" : "500",
          "size" : "80"
        },
        "max_anomaly_records" : "500",
        "enable_config_migration" : "true",
        "max_open_jobs" : "20",
        "min_disk_space_off_heap" : "5gb",
        "node_concurrent_job_allocations" : "2",
        "max_model_memory_limit" : "0b",
        "enabled" : "true",
        "max_lazy_ml_nodes" : "0",
        "max_machine_memory_percent" : "30",
        "autodetect_process" : "true",
        "datafeed_thread_pool" : {
          "queue_size" : "200",
          "size" : "20"
        },
        "process_connect_timeout" : "10s",
        "autodetect_thread_pool" : {
          "queue_size" : "80",
          "size" : "80"
        }
      }
    },
    "rest" : {
      "action" : {
        "multi" : {
          "allow_explicit_index" : "true"
        }
      }
    },
    "cache" : {
      "recycler" : {
        "page" : {
          "limit" : {
            "heap" : "10%"
          },
          "type" : "CONCURRENT",
          "weight" : {
            "longs" : "1.0",
            "ints" : "1.0",
            "bytes" : "1.0",
            "objects" : "0.1"
          }
        }
      }
    },
    "reindex" : {
      "remote" : {
        "whitelist" : [ ]
      }
    },
    "max" : {
      "anomaly" : {
        "records" : "500"
      }
    },
    "resource" : {
      "reload" : {
        "enabled" : "true",
        "interval" : {
          "low" : "60s",
          "high" : "5s",
          "medium" : "30s"
        }
      }
    },
    "thread_pool" : {
      "force_merge" : {
        "queue_size" : "-1",
        "size" : "1"
      },
      "fetch_shard_started" : {
        "core" : "1",
        "max" : "12",
        "keep_alive" : "5m"
      },
      "listener" : {
        "queue_size" : "-1",
        "size" : "3"
      },
      "index" : {
        "queue_size" : "200",
        "size" : "6"
      },
      "refresh" : {
        "core" : "1",
        "max" : "3",
        "keep_alive" : "5m"
      },
      "generic" : {
        "core" : "4",
        "max" : "128",
        "keep_alive" : "30s"
      },
      "warmer" : {
        "core" : "1",
        "max" : "3",
        "keep_alive" : "5m"
      },
      "search" : {
        "max_queue_size" : "1000",
        "queue_size" : "1000",
        "size" : "10",
        "auto_queue_frame_size" : "2000",
        "target_response_time" : "1s",
        "min_queue_size" : "1000"
      },
      "fetch_shard_store" : {
        "core" : "1",
        "max" : "12",
        "keep_alive" : "5m"
      },
      "flush" : {
        "core" : "1",
        "max" : "3",
        "keep_alive" : "5m"
      },
      "management" : {
        "core" : "1",
        "max" : "5",
        "keep_alive" : "5m"
      },
      "analyze" : {
        "queue_size" : "16",
        "size" : "1"
      },
      "get" : {
        "queue_size" : "1000",
        "size" : "6"
      },
      "bulk" : {
        "queue_size" : "200",
        "size" : "6"
      },
      "estimated_time_interval" : "200ms",
      "write" : {
        "queue_size" : "200",
        "size" : "6"
      },
      "snapshot" : {
        "core" : "1",
        "max" : "3",
        "keep_alive" : "5m"
      },
      "search_throttled" : {
        "max_queue_size" : "100",
        "queue_size" : "100",
        "size" : "1",
        "auto_queue_frame_size" : "200",
        "target_response_time" : "1s",
        "min_queue_size" : "100"
      }
    },
    "index" : {
      "codec" : "default",
      "store" : {
        "type" : "",
        "fs" : {
          "fs_lock" : "native"
        },
        "preload" : [ ]
      }
    },
    "monitor" : {
      "jvm" : {
        "gc" : {
          "enabled" : "true",
          "overhead" : {
            "warn" : "50",
            "debug" : "10",
            "info" : "25"
          },
          "refresh_interval" : "1s"
        },
        "refresh_interval" : "1s"
      },
      "process" : {
        "refresh_interval" : "1s"
      },
      "os" : {
        "refresh_interval" : "1s"
      },
      "fs" : {
        "refresh_interval" : "1s"
      }
    },
    "transport" : {
      "tcp" : {
        "reuse_address" : "true",
        "connect_timeout" : "30s",
        "compress" : "false",
        "port" : "9300-9400",
        "no_delay" : "true",
        "keep_alive" : "true",
        "receive_buffer_size" : "-1b",
        "send_buffer_size" : "-1b"
      },
      "bind_host" : [ ],
      "connect_timeout" : "30s",
      "compress" : "false",
      "ping_schedule" : "-1",
      "connections_per_node" : {
        "recovery" : "2",
        "state" : "1",
        "bulk" : "3",
        "reg" : "6",
        "ping" : "1"
      },
      "tracer" : {
        "include" : [ ],
        "exclude" : [
          "internal:discovery/zen/fd*",
          "cluster:monitor/nodes/liveness"
        ]
      },
      "type" : "security4",
      "type.default" : "netty4",
      "features" : {
        "x-pack" : "true"
      },
      "port" : "9300-9400",
      "host" : [ ],
      "publish_port" : "-1",
      "tcp_no_delay" : "true",
      "publish_host" : [ ],
      "netty" : {
        "receive_predictor_size" : "64kb",
        "receive_predictor_max" : "64kb",
        "worker_count" : "12",
        "receive_predictor_min" : "64kb",
        "boss_count" : "1"
      }
    },
    "script" : {
      "allowed_contexts" : [ ],
      "max_compilations_rate" : "75/5m",
      "cache" : {
        "max_size" : "100",
        "expire" : "0ms"
      },
      "painless" : {
        "regex" : {
          "enabled" : "false"
        }
      },
      "max_size_in_bytes" : "65535",
      "allowed_types" : [ ]
    },
    "node" : {
      "data" : "true",
      "enable_lucene_segment_infos_trace" : "false",
      "local_storage" : "true",
      "max_local_storage_nodes" : "1",
      "name" : "graylog-esnode1",
      "id" : {
        "seed" : "0"
      },
      "store" : {
        "allow_mmap" : "true",
        "allow_mmapfs" : "true"
      },
      "attr" : {
        "xpack" : {
          "installed" : "true"
        },
        "ml" : {
          "machine_memory" : "61165051904",
          "max_open_jobs" : "20",
          "enabled" : "true"
        }
      },
      "portsfile" : "false",
      "ingest" : "false",
      "master" : "false",
      "ml" : "true"
    },
    "indices" : {
      "cache" : {
        "cleanup_interval" : "1m"
      },
      "mapping" : {
        "dynamic_timeout" : "30s"
      },
      "memory" : {
        "interval" : "5s",
        "max_index_buffer_size" : "-1",
        "shard_inactive_time" : "5m",
        "index_buffer_size" : "10%",
        "min_index_buffer_size" : "48mb"
      },
      "breaker" : {
        "request" : {
          "type" : "memory",
          "overhead" : "1.0"
        },
        "total" : {
          "limit" : "70%"
        },
        "accounting" : {
          "overhead" : "1.0"
        },
        "fielddata" : {
          "type" : "memory",
          "overhead" : "1.03"
        },
        "type" : "hierarchy"
      },
      "query" : {
        "bool" : {
          "max_clause_count" : "1024"
        },
        "query_string" : {
          "analyze_wildcard" : "false",
          "allowLeadingWildcard" : "true"
        }
      },
      "admin" : {
        "filtered_fields" : "true"
      },
      "recovery" : {
        "recovery_activity_timeout" : "1800000ms",
        "retry_delay_network" : "5s",
        "internal_action_timeout" : "15m",
        "retry_delay_state_sync" : "500ms",
        "internal_action_long_timeout" : "1800000ms",
        "max_bytes_per_sec" : "150mb",
        "max_concurrent_file_chunks" : "1"
      },
      "requests" : {
        "cache" : {
          "size" : "1%",
          "expire" : "0ms"
        }
      },
      "store" : {
        "delete" : {
          "shard" : {
            "timeout" : "30s"
          }
        }
      },
      "analysis" : {
        "hunspell" : {
          "dictionary" : {
            "ignore_case" : "false",
            "lazy" : "false"
          }
        }
      },
      "queries" : {
        "cache" : {
          "count" : "10000",
          "size" : "10%",
          "all_segments" : "false"
        }
      },
      "lifecycle" : {
        "poll_interval" : "10m"
      },
      "fielddata" : {
        "cache" : {
          "size" : "60%"
        }
      }
    },
    "plugin" : {
      "mandatory" : [ ]
    },
    "max_running_jobs" : "20",
    "discovery" : {
      "type" : "zen",
      "zen" : {
        "commit_timeout" : "30s",
        "no_master_block" : "write",
        "join_retry_delay" : "100ms",
        "join_retry_attempts" : "3",
        "ping" : {
          "unicast" : {
            "concurrent_connects" : "10",
            "hosts" : [
              "graylog-esmaster1.swmcloud.net",
              "graylog-esmaster2.swmcloud.net",
              "graylog-esmaster3.swmcloud.net"
            ],
            "hosts.resolve_timeout" : "5s"
          }
        },
        "master_election" : {
          "ignore_non_master_pings" : "false",
          "wait_for_joins_timeout" : "30000ms"
        },
        "send_leave_request" : "true",
        "ping_timeout" : "3s",
        "join_timeout" : "60000ms",
        "publish_diff" : {
          "enable" : "true"
        },
        "publish" : {
          "max_pending_cluster_states" : "25"
        },
        "minimum_master_nodes" : "2",
        "hosts_provider" : [ ],
        "publish_timeout" : "30s",
        "fd" : {
          "connect_on_network_disconnect" : "false",
          "ping_interval" : "1s",
          "ping_retries" : "3",
          "register_connection_listener" : "true",
          "ping_timeout" : "30s"
        },
        "max_pings_from_another_master" : "3"
      },
      "initial_state_timeout" : "30s"
    },
    "tribe" : {
      "name" : "",
      "on_conflict" : "any",
      "blocks" : {
        "metadata" : "false",
        "read" : {
          "indices" : [ ]
        },
        "write.indices" : [ ],
        "write" : "false",
        "metadata.indices" : [ ]
      }
    },
    "http" : {
      "cors" : {
        "max-age" : "1728000",
        "allow-origin" : "",
        "allow-headers" : "X-Requested-With,Content-Type,Content-Length",
        "allow-credentials" : "false",
        "allow-methods" : "OPTIONS,HEAD,GET,POST,PUT,DELETE",
        "enabled" : "false"
      },
      "max_chunk_size" : "8kb",
      "compression_level" : "3",
      "max_initial_line_length" : "4kb",
      "type" : "security4",
      "pipelining" : "true",
      "enabled" : "true",
      "type.default" : "netty4",
      "content_type" : {
        "required" : "true"
      },
      "host" : [ ],
      "publish_port" : "-1",
      "read_timeout" : "0ms",
      "max_content_length" : "100mb",
      "netty" : {
        "receive_predictor_size" : "64kb",
        "max_composite_buffer_components" : "69905",
        "receive_predictor_max" : "64kb",
        "worker_count" : "12",
        "receive_predictor_min" : "64kb"
      },
      "tcp" : {
        "reuse_address" : "true",
        "keep_alive" : "true",
        "receive_buffer_size" : "-1b",
        "no_delay" : "true",
        "send_buffer_size" : "-1b"
      },
      "bind_host" : [ ],
      "reset_cookies" : "false",
      "max_warning_header_count" : "-1",
      "max_warning_header_size" : "-1b",
      "detailed_errors" : {
        "enabled" : "true"
      },
      "port" : "9200-9300",
      "max_header_size" : "8kb",
      "pipelining.max_events" : "10000",
      "tcp_no_delay" : "true",
      "compression" : "true",
      "publish_host" : [ ]
    },
    "gateway" : {
      "recover_after_master_nodes" : "0",
      "expected_nodes" : "-1",
      "recover_after_data_nodes" : "-1",
      "expected_data_nodes" : "-1",
      "recover_after_time" : "0ms",
      "expected_master_nodes" : "-1",
      "recover_after_nodes" : "-1"
    }
  }
}

3. What steps have you already taken to try and solve the problem?

Add graylog nodes
Create multiple indices (with shards number equal es nodes number) to have a good shards distribution in elasticsearch cluster.
Create mapping template to reduce indexer failures errors.

4. How can the community help?

Help use to find the root cause of uneven distribution of unprocessed messages.

Thanks for your help.

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

gsmith · May 3, 2022, 1:00am

Hello && welcome

I have been looking over you configurations for a couple days and correct me if I’m wrong but is this a 9 Node cluster and each node has ES. GL and MongoDb? The reason I asked this is most of the time in Graylog configurations with multiple instances of ES and MongoDb these settings would be configure like so.

elasticsearch_hosts = http://10.10.10.10:9200, http://10.10.10.11:9200, http://10.10.10.22:9200

mongodb_uri = mongodb://10.10.10.10:27017, http://10.10.10.11:27017, http://10.10.10.22:27017/graylog?replicaSet=replica01

What I’m not seeing in GL configure is this section. So I assume you made modification.

elasticsearch_index_prefix = graylog

What versions of services do you have installed?
By change do you have a load balancer in from of your cluster?

teddy · May 3, 2022, 8:38am

Hello gsmith,

Thank you for your help.

In order to list es nodes and mongodb nodes, we use consul from hashicorps that allows to have a dynamic configuration and a nodes discovery.

elasticsearch_hosts = http://GRAYLOG-elasticsearch.service.swmconsul:9200

mongodb_uri = mongodb://GRAYLOG-mongodb.service.swmconsul:27017/admin?replicaSet=graylog

Graylog version 4.2.8
Elasticsearch version: 6.8
Mongodb version: 4.2.7

For the loadbalancer, we have avi vantage that use the nodes discovery with consul.

After your comment, we analysed the load balancer and now we think that consul discovery does not do the work. Instead of our 9 graylog nodes, consul has only discovered 3 nodes.

Now the discovering from consul work but the uneven distribution persist :(.

We have 9 graylog nodes and 7 es nodes what did we configure wrong?

Thank you very much for your help.

gsmith · May 3, 2022, 11:28pm

Hello,

My apologies I haven’t used avi vantage nor hashicorps so I probably wouldn’t be any help troubleshooting.

Since your using other software I’m not 100% sure where your issue could be. I could show you a mockup of what I did thou. I understand every environment is different but there are common configuration that are needed. Below is for Elasticsearch 5.x I also used these for 6.x. I cant remember but I think there were some new changes in the configuration file on 6.x.

Example:
For each node noticed how I configured each configuration file. 3 master nodes and 7 data nodes.

sudo vim /etc/elasticsearch/elasticsearch.yml

cluster.name: graylog
network.host: 10.10.10.10
http.port: 9200
node.name: lab-elastic-001
node.master: true
node.data: true
discovery.zen.ping.unicast.hosts: ["10.10.10.11","10.10.10.12","10.10.10.13",10.10.10.14,10.10.10.15,10.10.10.16]
discovery.zen.minimum_master_nodes: 3

cluster.name: graylog
network.host: 10.10.10.11
http.port: 9200
node.name: lab-elastic-002
node.master: true
node.data: true
discovery.zen.ping.unicast.hosts: ["10.10.10.10","10.10.10.12","10.10.10.13",10.10.10.14,10.10.10.15,10.10.10.16]
discovery.zen.minimum_master_nodes: 3

cluster.name: graylog
network.host: 10.10.10.12
http.port: 9200
node.name: lab-elastic-003
node.master: true
node.data: true
discovery.zen.ping.unicast.hosts: ["10.10.10.10","10.10.10.11","10.10.10.13",10.10.10.14,10.10.10.15,10.10.10.16]
discovery.zen.minimum_master_nodes: 3

cluster.name: graylog
network.host: 10.10.10.13
http.port: 9200
node.name: lab-elastic-004
node.master: true
node.data: true
discovery.zen.ping.unicast.hosts: ["10.10.10.10","10.10.10.11","10.10.10.12",10.10.10.14,10.10.10.15,10.10.10.16]
discovery.zen.minimum_master_nodes: 3

cluster.name: graylog
network.host: 10.10.10.14
http.port: 9200
node.name: lab-elastic-005
node.master: false
node.data: true
discovery.zen.ping.unicast.hosts: ["10.10.10.10","10.10.10.11","10.10.10.12",10.10.10.13,10.10.10.15,10.10.10.16]
discovery.zen.minimum_master_nodes: 3


cluster.name: graylog
network.host: 10.10.10.15
http.port: 9200
node.name: lab-elastic-006
node.master: false
node.data: true
discovery.zen.ping.unicast.hosts: ["10.10.10.10","10.10.10.11","10.10.10.12",10.10.10.13,10.10.10.14,10.10.10.16]
discovery.zen.minimum_master_nodes: 3

cluster.name: graylog
network.host: 10.10.10.16
http.port: 9200
node.name: lab-elastic-007
node.master: false
node.data: true
discovery.zen.ping.unicast.hosts: ["10.10.10.10","10.10.10.11","10.10.10.12",10.10.10.13,10.10.10.14,10.10.10.15]
discovery.zen.minimum_master_nodes: 3

Sum it up.

3 Master nodes
7 Data nodes

My graylog configuration file corresponds to these configurations for master nodes above.

elasticsearch_hosts = http://lab-elastic-001:9200, http://lab-elastic-002:9200, http://lab-elastic-003:9200,http://lab-elastic-004:9200

I have used nginx for load balancer but I haven’t had this issue with uneven distribution.
Not sure if that helps

gsmith · May 3, 2022, 11:39pm

Second part of my example:
This was for a three node MongoDb cluster

Only one instance runs as ‘PRIMARY’, all other instances are ‘SECONDARY’.
Data is written only on the ‘PRIMARY’ instance, the data sets are then replicated to all ‘SECONDARY’ instances.

Execute command

shell# mongo
shell# rs.initiate()
Add the ' lab-graylog-002 ' and ' lab-graylog-003 ' nodes to the replica sets.
Execute command
shell#rs.add("lab-graylog-002 ")
Shell#rs.add("lab-graylog-003 ")

## Check the replica sets status with the rs query below.
Shell# rs.status()
## Query to check the status on ‘lab-graylog-001 :

Shell# rs.isMaster()  // Should Show “isMaster” =True
NOTE: Enable reading from the 'SECONDARY' node with the query 'rs.slaveOk()'

My Graylog Configuration file corrisponds to those settings

mongodb_uri = mongodb://lab-graylog-001:27017,lab-graylog-002:27017,lab-graylog-003:27017/graylog?replicaSet=replica01

I used most of these suggestion in the documentation here.

Multi-node Setup - Configuring Graylog

Hope that helps

teddy · May 4, 2022, 2:08pm

Hello,

Thank you very very much gsmith.

We have identified the issue.
input gelf udp send messages to 3 nodes instead of 9.

We have also corrected a problem with the loadbalancer avi and now we have a more even distribution of messages for all inputs except for gelf udp.
But this raised a new problem, indexer failures with this message:

rejected execution of processing of [425888633][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[tomcat__109][7]] containing [53] requests, target allocation id: 2Sv0bIIcR7WB9mkEZ8ZuVg, primary term: 1 on EsThreadPoolExecutor[name = graylog-esnode3/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@4596eefa[Running, pool size = 6, active threads = 6, queued tasks = 200, completed tasks = 412462213]]

I found this topic:

So, now i will try to correct the gelf udp input and the indexer failures.
If you have any tips, I’ll be happy to read them.

Thank you very much gsmith for your time.

gsmith · May 4, 2022, 10:02pm

Hello @teddy

Only thing that can come to mind is maybe raise output_batch_size to 1000 or 2000, raise outputbuffer_processor to 5 and set your index_refreshrate to 30 seconds in Elasticsearch.

EDIT:

Here is an example of my Graylog Server. I ingest about 30-40 GB message a day. This is just one node. I did have the same issue a while back. As shown below is what I stated above.

elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = true
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 2000 <---- Raise
output_flush_interval = 1
output_fault_count_threshold = 5
inputbuffer_processors = 2
processbuffer_processors = 8
outputbuffer_processors = 5 <-- Increase

In your index set “Edit” configurations

Of course when editing you GL configuration file you need to restart.
insure all the GL nodes have these configuration’s

Hope that helps

teddy · May 5, 2022, 10:37am

Hello @gsmith,

I have increased output_batch_size to 2000 but for other mentionned parameters, i can’t add more processors because we have a lack of resources. Right now, graylog works very well, i wait few days to analyse and adjust parameters.

Thank you for your help and your time.

Have a nice day

gsmith · May 5, 2022, 9:11pm

Sound good Keep us updated.

system · May 19, 2022, 9:11pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Uneven distribution of unprocessed messages Graylog Central (peer support) pipeline-rules , dump-messagespl	13	2055	August 27, 2021
Unprocessed messages Graylog Central (peer support)	9	2186	January 8, 2018
Answer to Graylog Journal is fully utilized and there are millions of unprocessed messages Graylog Central (peer support)	5	2111	January 11, 2019
Unprocessed messages is constantly increasing Graylog Central (peer support)	4	4677	June 24, 2020
Best way to debug and compare graylog nodes - one not processing messages Graylog Central (peer support) pipeline-rules , dump-messagespl , debuggingpl	9	1486	September 6, 2021

Uneven distribution of unprocessed messages in graylog nodes

Related topics