Process buffer and output buffer are full. Journal over the allowed size. No messages written to elastic

chimi · October 2, 2024, 10:34am

Describe your incident:
I’m running 4 instances of Graylog nodes 5.0.5 managed by load balancer and 5 instances of ElasticSearch. Those four nodes are not sending messages to elasticsearch and there are a lot of messages that saturates the journal. In server.log there are a lot of this error messages:
ERROR [Messages] Caught exception during bulk indexing: ElasticsearchException{message=ElasticsearchException[An error occurred: ]; nested: IOException[Connection reset]; nested: SocketException[Connection reset];, errorDetails=}, retrying (attempt #38)

All hosts of elastic and graylog clusters are on the same subnet and there is nothing to prevent communication between them even testing it manually there do not seem to be any problems.
The infrastructure has been running smoothly for months and suddenly this problem has arisen.
Instead in elasticsearch’s side there aren’t evident issues, and the resource consumption is low.

Describe your environment:
Graylog nodes details:
CPU
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 1
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7J13 64-Core Processor
Stepping: 1
CPU MHz: 2445.392
BogoMIPS: 4890.78
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 64K
L1i cache: 64K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-15

OS version
NAME=“Oracle Linux Server”
VERSION=“8.8”
ID=“ol”
ID_LIKE=“fedora”
VARIANT=“Server”
VARIANT_ID=“server”
VERSION_ID=“8.8”
PLATFORM_ID=“platform:el8”
PRETTY_NAME=“Oracle Linux Server 8.8”
ANSI_COLOR=“0;31”
CPE_NAME=“cpe:/o:oracle:linux:8:8:server”

Elasticsearch Version:7.10.2
Graylog Version:5.0.5

GRAYLOG graylog.conf
is_leader = true
node_id_file = /appl/graylog/graylog-5.0.5-linux-x64/node_id
password_secret = xxxx
root_password_sha2 = xxxx
root_timezone = America/Sao_Paulo
bin_dir = /appl/graylog/graylog-5.0.5-linux-x64/bin
data_dir = /appl/graylog/graylog-5.0.5-linux-x64/data
plugin_dir = /appl/graylog/graylog-5.0.5-linux-x64/plugin
http_bind_address = node_ip_addr:9000
http_enable_tls = true
http_tls_cert_file = /appl/graylog/cert.pem
http_tls_key_file = /appl/graylog/pkcs8-encrypted.pem
http_tls_key_password = pwd
stream_aware_field_types=false
elasticsearch_hosts = host1:9200,host2:9200,host3:9200,host4:9200,host5:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 3
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 1000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 7
outputbuffer_processors = 4
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 3
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = data/journal
message_journal_max_age = 24h
message_journal_max_size = 10gb
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://host1:27017,host2:27017,host3:27017,host4:27017/graylog?replicaSet=rs0
mongodb_max_connections = 1000

ELASTICSEARCH CONFIG
cluster.name: elasticsearch_cluster
node.name: hostname
node.attr.rack: r1
path.data: /appl/elasticsearch/elasticsearch-7.10.2/data
path.logs: /appl/elasticsearch/elasticsearch-7.10.2/log
network.host: ip_add
discovery.seed_hosts: [“host1”, “host2”, “host3”,“host4”,“host5”]
cluster.initial_master_nodes: [“host1-1”, “host2-2”, “host3-3”]
gateway.recover_after_nodes: 2
action.auto_create_index: false

opendistro_security.ssl.transport.pemcert_filepath: esnode.pem
opendistro_security.ssl.transport.pemkey_filepath: esnode-key.pem
opendistro_security.ssl.transport.pemtrustedcas_filepath: root-ca.pem
opendistro_security.ssl.transport.enforce_hostname_verification: false
opendistro_security.ssl.http.enabled: false
opendistro_security.ssl.http.pemcert_filepath: esnode.pem
opendistro_security.ssl.http.pemkey_filepath: esnode-key.pem
opendistro_security.ssl.http.pemtrustedcas_filepath: root-ca.pem
opendistro_security.allow_unsafe_democertificates: true
opendistro_security.allow_default_init_securityindex: true
opendistro_security.authcz.admin_dn:

CN=kirk,OU=client,O=client,L=test, C=de

#opendistro_security.audit.type: internal_elasticsearch
opendistro_security.enable_snapshot_restore_privilege: true
opendistro_security.check_snapshot_restore_write_privileges: true
opendistro_security.restapi.roles_enabled: [“all_access”, “security_rest_api_access”, “readall”]
opendistro_security.system_indices.enabled: true
opendistro_security.system_indices.indices: [“.opendistro-alerting-config”, “.opendistro-alerting-alert*”, “.opendistro-anomaly-results*”, “.opendistro-anomaly-detector*”, “.opendistro-anomaly-checkpoints”, “.opendistro-anomaly-detection-state”, “.opendistro-reports-", ".opendistro-notifications-”, “.opendistro-notebooks”, “.opendistro-asynchronous-search-response*”]
cluster.routing.allocation.disk.threshold_enabled: false
node.max_local_storage_nodes: 5

What steps have you already taken to try and solve the problem?
I tried to raise the number of processors allocated, processbuffer_processors = 7, outputbuffer_processors = 4, inputbuffer_processor=3 but nothing changed. I restarted several time the graylog services, seems to work initially but now the situation seems blocked.
I tried also to increase Field type refresh interval from 5 s to 30 s

Any idea on what to troubleshoot in order to solve this issue?

Wine_Merchant · October 3, 2024, 10:38am

Hello @chimi

Does the the cluster health api call return anything of interest?

Are you sure there is nothing is any of the logs on your Opensearch hosts?

chimi · October 4, 2024, 12:29pm

Hello @Wine_Merchant
There aren’t usefull logs into the Elasticsearch hosts, and also the cluster health api return alway green as status and nothing else interesting.
The problem seems to be on the communication between Graylog and Elastic on the Graylog side.

system · October 18, 2024, 12:30pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Process and Output buffer are Full. None of the messages are flowing out Graylog Central (peer support) elastic	5	2371	July 6, 2022
Process Buffer - Output Buffer Full Graylog Central (peer support)	5	6715	September 22, 2020
Graylog Cluster, Buffer process 100% stop process messages Graylog Central (peer support)	22	17028	November 28, 2018
Disk Journal is full and Process buffer is full Graylog Central (peer support)	2	360	March 31, 2023
Process and output buffers are full Graylog Central (peer support)	19	9528	November 30, 2020

Process buffer and output buffer are full. Journal over the allowed size. No messages written to elastic

Related topics