Process buffer and output buffer are full. Journal over the allowed size. No messages written to elastic

  1. Describe your incident:
    I’m running 4 instances of Graylog nodes 5.0.5 managed by load balancer and 5 instances of ElasticSearch. Those four nodes are not sending messages to elasticsearch and there are a lot of messages that saturates the journal. In server.log there are a lot of this error messages:
    ERROR [Messages] Caught exception during bulk indexing: ElasticsearchException{message=ElasticsearchException[An error occurred: ]; nested: IOException[Connection reset]; nested: SocketException[Connection reset];, errorDetails=}, retrying (attempt #38)

All hosts of elastic and graylog clusters are on the same subnet and there is nothing to prevent communication between them even testing it manually there do not seem to be any problems.
The infrastructure has been running smoothly for months and suddenly this problem has arisen.
Instead in elasticsearch’s side there aren’t evident issues, and the resource consumption is low.

  1. Describe your environment:
    Graylog nodes details:
    CPU
    Architecture: x86_64
    CPU op-mode(s): 32-bit, 64-bit
    Byte Order: Little Endian
    CPU(s): 16
    On-line CPU(s) list: 0-15
    Thread(s) per core: 2
    Core(s) per socket: 8
    Socket(s): 1
    NUMA node(s): 1
    Vendor ID: AuthenticAMD
    CPU family: 25
    Model: 1
    Model name: AMD EPYC 7J13 64-Core Processor
    Stepping: 1
    CPU MHz: 2445.392
    BogoMIPS: 4890.78
    Virtualization: AMD-V
    Hypervisor vendor: KVM
    Virtualization type: full
    L1d cache: 64K
    L1i cache: 64K
    L2 cache: 512K
    L3 cache: 16384K
    NUMA node0 CPU(s): 0-15

OS version
NAME=“Oracle Linux Server”
VERSION=“8.8”
ID=“ol”
ID_LIKE=“fedora”
VARIANT=“Server”
VARIANT_ID=“server”
VERSION_ID=“8.8”
PLATFORM_ID=“platform:el8”
PRETTY_NAME=“Oracle Linux Server 8.8”
ANSI_COLOR=“0;31”
CPE_NAME=“cpe:/o:oracle:linux:8:8:server”

Elasticsearch Version:7.10.2
Graylog Version:5.0.5

GRAYLOG graylog.conf
is_leader = true
node_id_file = /appl/graylog/graylog-5.0.5-linux-x64/node_id
password_secret = xxxx
root_password_sha2 = xxxx
root_timezone = America/Sao_Paulo
bin_dir = /appl/graylog/graylog-5.0.5-linux-x64/bin
data_dir = /appl/graylog/graylog-5.0.5-linux-x64/data
plugin_dir = /appl/graylog/graylog-5.0.5-linux-x64/plugin
http_bind_address = node_ip_addr:9000
http_enable_tls = true
http_tls_cert_file = /appl/graylog/cert.pem
http_tls_key_file = /appl/graylog/pkcs8-encrypted.pem
http_tls_key_password = pwd
stream_aware_field_types=false
elasticsearch_hosts = host1:9200,host2:9200,host3:9200,host4:9200,host5:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 3
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 1000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 7
outputbuffer_processors = 4
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 3
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = data/journal
message_journal_max_age = 24h
message_journal_max_size = 10gb
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://host1:27017,host2:27017,host3:27017,host4:27017/graylog?replicaSet=rs0
mongodb_max_connections = 1000

ELASTICSEARCH CONFIG
cluster.name: elasticsearch_cluster
node.name: hostname
node.attr.rack: r1
path.data: /appl/elasticsearch/elasticsearch-7.10.2/data
path.logs: /appl/elasticsearch/elasticsearch-7.10.2/log
network.host: ip_add
discovery.seed_hosts: [“host1”, “host2”, “host3”,“host4”,“host5”]
cluster.initial_master_nodes: [“host1-1”, “host2-2”, “host3-3”]
gateway.recover_after_nodes: 2
action.auto_create_index: false

opendistro_security.ssl.transport.pemcert_filepath: esnode.pem
opendistro_security.ssl.transport.pemkey_filepath: esnode-key.pem
opendistro_security.ssl.transport.pemtrustedcas_filepath: root-ca.pem
opendistro_security.ssl.transport.enforce_hostname_verification: false
opendistro_security.ssl.http.enabled: false
opendistro_security.ssl.http.pemcert_filepath: esnode.pem
opendistro_security.ssl.http.pemkey_filepath: esnode-key.pem
opendistro_security.ssl.http.pemtrustedcas_filepath: root-ca.pem
opendistro_security.allow_unsafe_democertificates: true
opendistro_security.allow_default_init_securityindex: true
opendistro_security.authcz.admin_dn:

  • CN=kirk,OU=client,O=client,L=test, C=de

    #opendistro_security.audit.type: internal_elasticsearch
    opendistro_security.enable_snapshot_restore_privilege: true
    opendistro_security.check_snapshot_restore_write_privileges: true
    opendistro_security.restapi.roles_enabled: [“all_access”, “security_rest_api_access”, “readall”]
    opendistro_security.system_indices.enabled: true
    opendistro_security.system_indices.indices: [“.opendistro-alerting-config”, “.opendistro-alerting-alert*”, “.opendistro-anomaly-results*”, “.opendistro-anomaly-detector*”, “.opendistro-anomaly-checkpoints”, “.opendistro-anomaly-detection-state”, “.opendistro-reports-", ".opendistro-notifications-”, “.opendistro-notebooks”, “.opendistro-asynchronous-search-response*”]
    cluster.routing.allocation.disk.threshold_enabled: false
    node.max_local_storage_nodes: 5

  1. What steps have you already taken to try and solve the problem?
    I tried to raise the number of processors allocated, processbuffer_processors = 7, outputbuffer_processors = 4, inputbuffer_processor=3 but nothing changed. I restarted several time the graylog services, seems to work initially but now the situation seems blocked.
    I tried also to increase Field type refresh interval from 5 s to 30 s

Any idea on what to troubleshoot in order to solve this issue?

Hello @chimi

Does the the cluster health api call return anything of interest?

Are you sure there is nothing is any of the logs on your Opensearch hosts?

Hello @Wine_Merchant
There aren’t usefull logs into the Elasticsearch hosts, and also the cluster health api return alway green as status and nothing else interesting.
The problem seems to be on the communication between Graylog and Elastic on the Graylog side.