Buffer full - Connection refused on 9200

I have graylog and elasticsearch all running on the same machine. The issue I have is the process buffer is full. I’ve checked the logs and here’s what I’ve found:

2020-07-23T15:51:27.751-04:00 ERROR [IndexFieldTypePollerPeriodical] Couldn't update field types for index set <Defaul
t index set/5f172f0e8b94001e849b6411>
org.graylog2.indexer.ElasticsearchException: Couldn't collect indices for alias graylog_deflector
        at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:54) ~[graylog.jar:?]
        at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:65) ~[graylog.jar:?]
        at org.graylog2.indexer.indices.Indices.aliasTarget(Indices.java:336) ~[graylog.jar:?]
        at org.graylog2.indexer.MongoIndexSet.getActiveWriteIndex(MongoIndexSet.java:204) ~[graylog.jar:?]
        at org.graylog2.indexer.fieldtypes.IndexFieldTypePollerPeriodical.lambda$schedule$4(IndexFieldTypePollerPeriodical.java:201) ~[graylog.jar:?]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_252]
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_252]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_252]
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_252]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
Caused by: io.searchbox.client.config.exception.CouldNotConnectException: Could not connect to http://127.0.0.1:9200
        at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:80) ~[graylog.jar:?]
        at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:49) ~[graylog.jar:?]
        ... 11 more

I can curl successfully my elasticsearch running on my instance:

curl http://127.0.0.1:9200
{
  "name" : "uF7RBi6",
  "cluster_name" : "graylog",
  "cluster_uuid" : "bY1zhhyRSS-aNR6IHH49BQ",
  "version" : {
    "number" : "6.8.10",
    "build_flavor" : "oss",
    "build_type" : "deb",
    "build_hash" : "537cb22",
    "build_date" : "2020-05-28T14:47:19.882936Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.3",
    "minimum_wire_compatibility_version" : "5.6.0",
    "minimum_index_compatibility_version" : "5.0.0"
  },
  "tagline" : "You Know, for Search"
}

Nothing in my elasticsearch logs.

Here are my settings in my server.conf (everything else is default)

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = retracted
root_password_sha2 = retracted
root_email = "admin@company.com"
root_timezone = America/Toronto
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 0.0.0.0:9000
trusted_proxies = 127.0.0.1/32, 0:0:0:0:0:0:0:1/128
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://localhost/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_enabled = true
transport_email_hostname = mail.company.com
transport_email_port = 25
transport_email_use_auth = false
transport_email_subject_prefix = [Graylog]
transport_email_from_email = graylog@servers.company.com
proxied_requests_thread_pool_size = 32

Everything is the default in the config for elasticsearch except for this:

cluster.name: graylog
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
action.auto_create_index: false

Netstat:

netstat -tunapl | grep 9200
tcp6       0      0 127.0.0.1:9200          :::*                    LISTEN      10342/java
tcp6       0      0 127.0.0.1:51814         127.0.0.1:9200          ESTABLISHED 9919/java
tcp6       0      0 127.0.0.1:9200          127.0.0.1:51814         ESTABLISHED 10342/java
tcp6       0      0 127.0.0.1:9200          127.0.0.1:51808         ESTABLISHED 10342/java
tcp6       0      0 127.0.0.1:51806         127.0.0.1:9200          ESTABLISHED 9919/java
tcp6       0      0 127.0.0.1:51812         127.0.0.1:9200          ESTABLISHED 9919/java
tcp6       0      0 127.0.0.1:9200          127.0.0.1:51806         ESTABLISHED 10342/java
tcp6       0      0 127.0.0.1:51808         127.0.0.1:9200          ESTABLISHED 9919/java
tcp6       0      0 127.0.0.1:9200          127.0.0.1:51816         ESTABLISHED 10342/java
tcp6       0      0 127.0.0.1:9200          127.0.0.1:51810         ESTABLISHED 10342/java
tcp6       0      0 127.0.0.1:51816         127.0.0.1:9200          ESTABLISHED 9919/java
tcp6       0      0 127.0.0.1:51810         127.0.0.1:9200          ESTABLISHED 9919/java
tcp6       0      0 127.0.0.1:9200          127.0.0.1:51812         ESTABLISHED 10342/java

/etc/hosts

127.0.1.1 dev-graylog-1n1 dev-graylog-1n1
127.0.0.1 localhost

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts

Is there any things I can check to help me debug ? Graylog is behind an external nginx reverse proxy but the webui works fine. Could it be related ?

I’m using Ubuntu 18.04 and installed using the official doc here https://docs.graylog.org/en/3.3/pages/installation/os/ubuntu.html

Any ideas ?
Thanks !

how many indices and shards does your elasticsearch cluster has?

I’m using the default config of elasticsearch. Here’s some output:

curl -X GET "http://localhost:9200/_cluster/health?pretty=true"
{
  "cluster_name" : "graylog",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 12,
  "active_shards" : 12,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

curl -X GET "http://localhost:9200/_cat/indices"
green open gl-events_0        hvpLLI6FRsKfIIB9FqK4VA 4 0     16 0 111.9kb 111.9kb
green open gl-system-events_0 kmGWAKuxQoWdw2x2g9G4Og 4 0      0 0     1kb     1kb
green open graylog_0          VP_DjYn2QxOT8YGnitoaFA 4 0 106960 0  43.9mb  43.9mb

So I’m assuming 3 indices with 12 shards total ?
The server is currently not in production and only 2 servers send their logs to this instance.
Specs are 8 cpu with 8gb ram.
Thanks

He @camaer

What I have seen in the first post:

Indicate that something is wrong - but that is nothing generic so you need to play sherlock yourself to find the reason. I have currently no idea why this might happen to you.

Is there any other place I should look other than graylog logs and elasticsearch logs ? I’ve already searched around but to no avail so this post was sort of my last resort :frowning:

After some more investigation it doesn’t seem to be related to the connection to elasticsearch. Maybe the error I got was when I messed around the config and restarted elasticsearch.
That being said, when I do a Process-buffer dump I can see some messages that seems stuck. Is there a way to delete those message or maybe get more info as to why they are stuck ? This could explain the high CPU usage.

Thanks

you have the option per node to get a processing buffer dump or a thread dump. Both can help to identify problems.

image

The Nodes details page might give you some additional insides about the buffers and the current state.

I can clearly see that some messages are holding the queue. Here’s the process buffer:

{
  "ProcessBufferProcessor #0": "source: webapp-1n1 | message: www: ~~utc:=[1595864127]~~type:=[INFO]~~dbname:=[db_cli_kdc]~~dbhost:=[db-1.domain.com]~~appcode:=[myapp]~~appMode:=[admin]~~appBuildNumber:=[44021]~~username:=[admin]~~remoteaddr:=[10.0.7.11]~~webserveraddr:=[192.168.1.154]~~phpsessionid:=[0emrvmo76vtq9e7i5i359dqe2m]~~scriptname:=[/webapp/prod/appsvc.php]~~printdocjob:=[polling]~~printdocid:=[2b211d6e-7b7d-4c81-ad5a-29b29f34e151]~~printdocprogress:=[6]~~file:=[/var/sources/sources/commonlib/printing/AbstractPDFReactorSetup.php]~~function:=[]~~line:=[]~~flags:=[] { application_name: ool | level: 5 | gl2_remote_ip: 192.168.1.154 | gl2_remote_port: 57028 | gl2_source_node: 064dafaa-7d54-4690-a279-1c20a54aa2ee | _id: b5f0e953-d035-11ea-bd6a-7243d60ac5d6 | gl2_source_input: 5f17396be742804092ad5a34 | facility: user-level | timestamp: 2020-07-27T11:35:27.258-04:00 }",
  "ProcessBufferProcessor #1": "source: webapp-1n1 | message: www: ~~utc:=[1595864131]~~type:=[INFO]~~dbname:=[db_cli_kdc]~~dbhost:=[db-1.domain.com]~~appcode:=[myapp]~~appMode:=[admin]~~appBuildNumber:=[44021]~~username:=[admin]~~remoteaddr:=[10.0.7.11]~~webserveraddr:=[192.168.1.154]~~phpsessionid:=[0emrvmo76vtq9e7i5i359dqe2m]~~scriptname:=[/webapp/prod/appsvc.php]~~printdocjob:=[polling]~~printdocid:=[2b211d6e-7b7d-4c81-ad5a-29b29f34e151]~~printdocprogress:=[6]~~file:=[/var/sources/sources/commonlib/printing/AbstractPDFReactorSetup.php]~~function:=[]~~line:=[]~~flags:=[] { application_name: ool | level: 5 | gl2_remote_ip: 192.168.1.154 | gl2_remote_port: 57028 | gl2_source_node: 064dafaa-7d54-4690-a279-1c20a54aa2ee | _id: b64c4fc0-d035-11ea-bd6a-7243d60ac5d6 | gl2_source_input: 5f17396be742804092ad5a34 | facility: user-level | timestamp: 2020-07-27T11:35:31.496-04:00 }",
  "ProcessBufferProcessor #2": "source: webapp-1n1 | message: www: ~~utc:=[1595864135]~~type:=[INFO]~~dbname:=[db_cli_kdc]~~dbhost:=[db-1.domain.com]~~appcode:=[myapp]~~appMode:=[admin]~~appBuildNumber:=[44021]~~username:=[admin]~~remoteaddr:=[10.0.7.11]~~webserveraddr:=[192.168.1.154]~~phpsessionid:=[0emrvmo76vtq9e7i5i359dqe2m]~~scriptname:=[/webapp/prod/appsvc.php]~~printdocjob:=[polling]~~printdocid:=[2b211d6e-7b7d-4c81-ad5a-29b29f34e151]~~printdocprogress:=[6]~~file:=[/var/sources/sources/commonlib/printing/AbstractPDFReactorSetup.php]~~function:=[]~~line:=[]~~flags:=[] { application_name: ool | level: 5 | gl2_remote_ip: 192.168.1.154 | gl2_remote_port: 57028 | gl2_source_node: 064dafaa-7d54-4690-a279-1c20a54aa2ee | _id: b67beb43-d035-11ea-bd6a-7243d60ac5d6 | gl2_source_input: 5f17396be742804092ad5a34 | facility: user-level | timestamp: 2020-07-27T11:35:35.766-04:00 }",
  "ProcessBufferProcessor #3": "source: webapp-1n1 | message: www: ~~utc:=[1595864144]~~type:=[INFO]~~dbname:=[db_cli_kdc]~~dbhost:=[db-1.domain.com]~~appcode:=[myapp]~~appMode:=[admin]~~appBuildNumber:=[44021]~~username:=[admin]~~remoteaddr:=[10.0.7.11]~~webserveraddr:=[192.168.1.154]~~phpsessionid:=[0emrvmo76vtq9e7i5i359dqe2m]~~scriptname:=[/webapp/prod/appsvc.php]~~printdocjob:=[polling]~~printdocid:=[2b211d6e-7b7d-4c81-ad5a-29b29f34e151]~~printdocprogress:=[6]~~file:=[/var/sources/sources/commonlib/printing/AbstractPDFReactorSetup.php]~~function:=[]~~line:=[]~~flags:=[] { application_name: ool | level: 5 | gl2_remote_ip: 192.168.1.154 | gl2_remote_port: 57028 | gl2_source_node: 064dafaa-7d54-4690-a279-1c20a54aa2ee | _id: b71210c1-d035-11ea-bd6a-7243d60ac5d6 | gl2_source_input: 5f17396be742804092ad5a34 | facility: user-level | timestamp: 2020-07-27T11:35:44.324-04:00 }",
  "ProcessBufferProcessor #4": "source: webapp-1n1 | message: www: ~~utc:=[1595864140]~~type:=[INFO]~~dbname:=[db_cli_kdc]~~dbhost:=[db-1.domain.com]~~appcode:=[myapp]~~appMode:=[admin]~~appBuildNumber:=[44021]~~username:=[admin]~~remoteaddr:=[10.0.7.11]~~webserveraddr:=[192.168.1.154]~~phpsessionid:=[0emrvmo76vtq9e7i5i359dqe2m]~~scriptname:=[/webapp/prod/appsvc.php]~~printdocjob:=[polling]~~printdocid:=[2b211d6e-7b7d-4c81-ad5a-29b29f34e151]~~printdocprogress:=[6]~~file:=[/var/sources/sources/commonlib/printing/AbstractPDFReactorSetup.php]~~function:=[]~~line:=[]~~flags:=[] { application_name: ool | level: 5 | gl2_remote_ip: 192.168.1.154 | gl2_remote_port: 57028 | gl2_source_node: 064dafaa-7d54-4690-a279-1c20a54aa2ee | _id: b6a93cd0-d035-11ea-bd6a-7243d60ac5d6 | gl2_source_input: 5f17396be742804092ad5a34 | facility: user-level | timestamp: 2020-07-27T11:35:40.112-04:00 }"
}

and the thread dump:

I’m not really familiar with Graylog so I don’t really know how to debug this. How can I understand what in the messages causes the hang ? is there a way to remove the problematic messages from the queue ?

Thanks !

Any ideas ? I’m currently at 3.5M messages in the journal :confused:

Thanks !

Bump. Still looking for some information on how I can debug a stuck message.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.