Buffer full - Connection refused on 9200

camaer · July 23, 2020, 8:08pm

I have graylog and elasticsearch all running on the same machine. The issue I have is the process buffer is full. I’ve checked the logs and here’s what I’ve found:

2020-07-23T15:51:27.751-04:00 ERROR [IndexFieldTypePollerPeriodical] Couldn't update field types for index set <Defaul
t index set/5f172f0e8b94001e849b6411>
org.graylog2.indexer.ElasticsearchException: Couldn't collect indices for alias graylog_deflector
        at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:54) ~[graylog.jar:?]
        at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:65) ~[graylog.jar:?]
        at org.graylog2.indexer.indices.Indices.aliasTarget(Indices.java:336) ~[graylog.jar:?]
        at org.graylog2.indexer.MongoIndexSet.getActiveWriteIndex(MongoIndexSet.java:204) ~[graylog.jar:?]
        at org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical.lambda$schedule$4(IndexFieldTypePollerPeriodical.java:201) ~[graylog.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_252]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_252]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_252]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_252]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
Caused by: io.searchbox.client.config.exception.CouldNotConnectException: Could not connect to http://127.0.0.1:9200
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:80) ~[graylog.jar:?]
        at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:49) ~[graylog.jar:?]
        ... 11 more

I can curl successfully my elasticsearch running on my instance:

curl http://127.0.0.1:9200
{
  "name" : "uF7RBi6",
  "cluster_name" : "graylog",
  "cluster_uuid" : "bY1zhhyRSS-aNR6IHH49BQ",
  "version" : {
    "number" : "6.8.10",
    "build_flavor" : "oss",
    "build_type" : "deb",
    "build_hash" : "537cb22",
    "build_date" : "2020-05-28T14:47:19.882936Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.3",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Nothing in my elasticsearch logs.

Here are my settings in my server.conf (everything else is default)

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = retracted
root_password_sha2 = retracted
root_email = "admin@company.com"
root_timezone = America/Toronto
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 0.0.0.0:9000
trusted_proxies = 127.0.0.1/32, 0:0:0:0:0:0:0:1/128
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://localhost/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_enabled = true
transport_email_hostname = mail.company.com
transport_email_port = 25
transport_email_use_auth = false
transport_email_subject_prefix = [Graylog]
transport_email_from_email = graylog@servers.company.com
proxied_requests_thread_pool_size = 32

Everything is the default in the config for elasticsearch except for this:

cluster.name: graylog
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
action.auto_create_index: false

Netstat:

netstat -tunapl | grep 9200
tcp6       0      0 127.0.0.1:9200          :::*                    LISTEN      10342/java
tcp6       0      0 127.0.0.1:51814         127.0.0.1:9200          ESTABLISHED 9919/java
tcp6       0      0 127.0.0.1:9200          127.0.0.1:51814         ESTABLISHED 10342/java
tcp6       0      0 127.0.0.1:9200          127.0.0.1:51808         ESTABLISHED 10342/java
tcp6       0      0 127.0.0.1:51806         127.0.0.1:9200          ESTABLISHED 9919/java
tcp6       0      0 127.0.0.1:51812         127.0.0.1:9200          ESTABLISHED 9919/java
tcp6       0      0 127.0.0.1:9200          127.0.0.1:51806         ESTABLISHED 10342/java
tcp6       0      0 127.0.0.1:51808         127.0.0.1:9200          ESTABLISHED 9919/java
tcp6       0      0 127.0.0.1:9200          127.0.0.1:51816         ESTABLISHED 10342/java
tcp6       0      0 127.0.0.1:9200          127.0.0.1:51810         ESTABLISHED 10342/java
tcp6       0      0 127.0.0.1:51816         127.0.0.1:9200          ESTABLISHED 9919/java
tcp6       0      0 127.0.0.1:51810         127.0.0.1:9200          ESTABLISHED 9919/java
tcp6       0      0 127.0.0.1:9200          127.0.0.1:51812         ESTABLISHED 10342/java

/etc/hosts

127.0.1.1 dev-graylog-1n1 dev-graylog-1n1
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Is there any things I can check to help me debug ? Graylog is behind an external nginx reverse proxy but the webui works fine. Could it be related ?

I’m using Ubuntu 18.04 and installed using the official doc here https://docs.graylog.org/en/3.3/pages/installation/os/ubuntu.html

Any ideas ?
Thanks !

jan · July 24, 2020, 10:04am

how many indices and shards does your elasticsearch cluster has?

camaer · July 24, 2020, 3:20pm

I’m using the default config of elasticsearch. Here’s some output:

curl -X GET "http://localhost:9200/_cluster/health?pretty=true"
{
  "cluster_name" : "graylog",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 12,
  "active_shards" : 12,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

curl -X GET "http://localhost:9200/_cat/indices"
green open gl-events_0        hvpLLI6FRsKfIIB9FqK4VA 4 0     16 0 111.9kb 111.9kb
green open gl-system-events_0 kmGWAKuxQoWdw2x2g9G4Og 4 0      0 0     1kb     1kb
green open graylog_0          VP_DjYn2QxOT8YGnitoaFA 4 0 106960 0  43.9mb  43.9mb

So I’m assuming 3 indices with 12 shards total ?
The server is currently not in production and only 2 servers send their logs to this instance.
Specs are 8 cpu with 8gb ram.
Thanks

jan · July 27, 2020, 8:45am

He @camaer

What I have seen in the first post:

Indicate that something is wrong - but that is nothing generic so you need to play sherlock yourself to find the reason. I have currently no idea why this might happen to you.

camaer · July 27, 2020, 5:08pm

Is there any other place I should look other than graylog logs and elasticsearch logs ? I’ve already searched around but to no avail so this post was sort of my last resort

camaer · July 27, 2020, 6:54pm

After some more investigation it doesn’t seem to be related to the connection to elasticsearch. Maybe the error I got was when I messed around the config and restarted elasticsearch.
That being said, when I do a Process-buffer dump I can see some messages that seems stuck. Is there a way to delete those message or maybe get more info as to why they are stuck ? This could explain the high CPU usage.

Thanks

jan · July 29, 2020, 7:23am

you have the option per node to get a processing buffer dump or a thread dump. Both can help to identify problems.

The Nodes details page might give you some additional insides about the buffers and the current state.

camaer · July 29, 2020, 12:38pm

I can clearly see that some messages are holding the queue. Here’s the process buffer:

{
  "ProcessBufferProcessor #0": "source: webapp-1n1 | message: www: ~~utc:=[1595864127]~~type:=[INFO]~~dbname:=[db_cli_kdc]~~dbhost:=[db-1.domain.com]~~appcode:=[myapp]~~appMode:=[admin]~~appBuildNumber:=[44021]~~username:=[admin]~~remoteaddr:=[10.0.7.11]~~webserveraddr:=[192.168.1.154]~~phpsessionid:=[0emrvmo76vtq9e7i5i359dqe2m]~~scriptname:=[/webapp/prod/appsvc.php]~~printdocjob:=[polling]~~printdocid:=[2b211d6e-7b7d-4c81-ad5a-29b29f34e151]~~printdocprogress:=[6]~~file:=[/var/sources/sources/commonlib/printing/AbstractPDFReactorSetup.php]~~function:=[]~~line:=[]~~flags:=[] { application_name: ool | level: 5 | gl2_remote_ip: 192.168.1.154 | gl2_remote_port: 57028 | gl2_source_node: 064dafaa-7d54-4690-a279-1c20a54aa2ee | _id: b5f0e953-d035-11ea-bd6a-7243d60ac5d6 | gl2_source_input: 5f17396be742804092ad5a34 | facility: user-level | timestamp: 2020-07-27T11:35:27.258-04:00 }",
  "ProcessBufferProcessor #1": "source: webapp-1n1 | message: www: ~~utc:=[1595864131]~~type:=[INFO]~~dbname:=[db_cli_kdc]~~dbhost:=[db-1.domain.com]~~appcode:=[myapp]~~appMode:=[admin]~~appBuildNumber:=[44021]~~username:=[admin]~~remoteaddr:=[10.0.7.11]~~webserveraddr:=[192.168.1.154]~~phpsessionid:=[0emrvmo76vtq9e7i5i359dqe2m]~~scriptname:=[/webapp/prod/appsvc.php]~~printdocjob:=[polling]~~printdocid:=[2b211d6e-7b7d-4c81-ad5a-29b29f34e151]~~printdocprogress:=[6]~~file:=[/var/sources/sources/commonlib/printing/AbstractPDFReactorSetup.php]~~function:=[]~~line:=[]~~flags:=[] { application_name: ool | level: 5 | gl2_remote_ip: 192.168.1.154 | gl2_remote_port: 57028 | gl2_source_node: 064dafaa-7d54-4690-a279-1c20a54aa2ee | _id: b64c4fc0-d035-11ea-bd6a-7243d60ac5d6 | gl2_source_input: 5f17396be742804092ad5a34 | facility: user-level | timestamp: 2020-07-27T11:35:31.496-04:00 }",
  "ProcessBufferProcessor #2": "source: webapp-1n1 | message: www: ~~utc:=[1595864135]~~type:=[INFO]~~dbname:=[db_cli_kdc]~~dbhost:=[db-1.domain.com]~~appcode:=[myapp]~~appMode:=[admin]~~appBuildNumber:=[44021]~~username:=[admin]~~remoteaddr:=[10.0.7.11]~~webserveraddr:=[192.168.1.154]~~phpsessionid:=[0emrvmo76vtq9e7i5i359dqe2m]~~scriptname:=[/webapp/prod/appsvc.php]~~printdocjob:=[polling]~~printdocid:=[2b211d6e-7b7d-4c81-ad5a-29b29f34e151]~~printdocprogress:=[6]~~file:=[/var/sources/sources/commonlib/printing/AbstractPDFReactorSetup.php]~~function:=[]~~line:=[]~~flags:=[] { application_name: ool | level: 5 | gl2_remote_ip: 192.168.1.154 | gl2_remote_port: 57028 | gl2_source_node: 064dafaa-7d54-4690-a279-1c20a54aa2ee | _id: b67beb43-d035-11ea-bd6a-7243d60ac5d6 | gl2_source_input: 5f17396be742804092ad5a34 | facility: user-level | timestamp: 2020-07-27T11:35:35.766-04:00 }",
  "ProcessBufferProcessor #3": "source: webapp-1n1 | message: www: ~~utc:=[1595864144]~~type:=[INFO]~~dbname:=[db_cli_kdc]~~dbhost:=[db-1.domain.com]~~appcode:=[myapp]~~appMode:=[admin]~~appBuildNumber:=[44021]~~username:=[admin]~~remoteaddr:=[10.0.7.11]~~webserveraddr:=[192.168.1.154]~~phpsessionid:=[0emrvmo76vtq9e7i5i359dqe2m]~~scriptname:=[/webapp/prod/appsvc.php]~~printdocjob:=[polling]~~printdocid:=[2b211d6e-7b7d-4c81-ad5a-29b29f34e151]~~printdocprogress:=[6]~~file:=[/var/sources/sources/commonlib/printing/AbstractPDFReactorSetup.php]~~function:=[]~~line:=[]~~flags:=[] { application_name: ool | level: 5 | gl2_remote_ip: 192.168.1.154 | gl2_remote_port: 57028 | gl2_source_node: 064dafaa-7d54-4690-a279-1c20a54aa2ee | _id: b71210c1-d035-11ea-bd6a-7243d60ac5d6 | gl2_source_input: 5f17396be742804092ad5a34 | facility: user-level | timestamp: 2020-07-27T11:35:44.324-04:00 }",
  "ProcessBufferProcessor #4": "source: webapp-1n1 | message: www: ~~utc:=[1595864140]~~type:=[INFO]~~dbname:=[db_cli_kdc]~~dbhost:=[db-1.domain.com]~~appcode:=[myapp]~~appMode:=[admin]~~appBuildNumber:=[44021]~~username:=[admin]~~remoteaddr:=[10.0.7.11]~~webserveraddr:=[192.168.1.154]~~phpsessionid:=[0emrvmo76vtq9e7i5i359dqe2m]~~scriptname:=[/webapp/prod/appsvc.php]~~printdocjob:=[polling]~~printdocid:=[2b211d6e-7b7d-4c81-ad5a-29b29f34e151]~~printdocprogress:=[6]~~file:=[/var/sources/sources/commonlib/printing/AbstractPDFReactorSetup.php]~~function:=[]~~line:=[]~~flags:=[] { application_name: ool | level: 5 | gl2_remote_ip: 192.168.1.154 | gl2_remote_port: 57028 | gl2_source_node: 064dafaa-7d54-4690-a279-1c20a54aa2ee | _id: b6a93cd0-d035-11ea-bd6a-7243d60ac5d6 | gl2_source_input: 5f17396be742804092ad5a34 | facility: user-level | timestamp: 2020-07-27T11:35:40.112-04:00 }"
}

and the thread dump:

I’m not really familiar with Graylog so I don’t really know how to debug this. How can I understand what in the messages causes the hang ? is there a way to remove the problematic messages from the queue ?

Thanks !

camaer · August 5, 2020, 12:57pm

Any ideas ? I’m currently at 3.5M messages in the journal

Thanks !

camaer · August 13, 2020, 6:16pm

Bump. Still looking for some information on how I can debug a stuck message.

system · August 27, 2020, 6:16pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Messages lost when buffers are full Graylog Central (peer support)	9	1082	December 24, 2020
Elastic connection issues according to the server.log Graylog Central (peer support)	5	866	November 5, 2020
Couldn't bulk index - Unable to flush message buffer Graylog Central (peer support)	2	1797	April 27, 2022
Process buffer filling Graylog Central (peer support)	6	823	April 15, 2021
Couldn't collect indices for alias graylog_deflector Graylog Central (peer support)	8	2823	July 1, 2020

Buffer full - Connection refused on 9200

Related topics