Struggling with Graylog stopping to export to Elasticsearch

Hello everyone,
I am fairly new to graylog but already took one day of debugging without figuring out why my graylog does not process anymore.
I started of with a Single-Server environment (GL + ES on one server) and ran into the issue that nothing was stored to elastic anymore.
I figured it might be a performance problem (although it does not make sense to me that nothing is handle…with performance issue I would expect a growing backlog but not nothing at all to be saved)
So I extended to a separate ES server.


Server obviously is Throttled as Journal (was extended to 20GB in the hope to find a fix before it would run full) is fully utilized.
Elastic is shown green with no indexer failures

Both servers are run on ubuntu 20.04LTS

  • Graylog 4.1.1+27dec96
    • VM with 12 Cores 12GB RAM
    • has >20GB free space
    • JVM Heap is set to 4GB
    • proxied via Apache2 (config below)
    • only HTTP/HTTPS + GL Input-Ports exposed via firewall
  • Elasticsearch 7.10.2
    • has 600GB free space
    • VM with 8 Cores 8GB RAM
    • Only port 9200+9300 exposed via firewall to GL-server

Some Information about my current setup (comments and default values et cetera removed, Hostnames changed):
/etc/graylog/server/server.conf

is_master = true

http_bind_address = 0.0.0.0:9000
http_external_uri = https://graylog.monitoring.domain.local/
elasticsearch_hosts = http://es-server:9200

elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = true

message_journal_max_age = 48h
message_journal_max_size = 20gb

/etc/apache2/sites-enabled/000-default.conf

<VirtualHost *:80>
  ServerName graylog.monitoring.domain.local
  ServerAlias graylog.monitoring.domain.local

  Redirect permanent / https://graylog.monitoring.domain.local/
</VirtualHost>


<VirtualHost *:443>
    ServerName graylog.monitoring.domain.local
    ProxyRequests Off
    SSLEngine on
    SSLOptions +StrictRequire
    SSLCertificateFile "/etc/ssl/certs/graylog.monitoring.domain.local.crt"
    SSLCertificateKeyFile "/etc/ssl/private/graylog.monitoring.domain.local.key"

    <Proxy *>
        Order deny,allow
        Allow from all
    </Proxy>

    <Location />
        RequestHeader set X-Graylog-Server-URL "https://graylog.monitoring.domain.local"
        ProxyPass http://127.0.0.1:9000/
        ProxyPassReverse http://127.0.0.1:9000/
    </Location>

</VirtualHost>

/etc/elasticsearch/elasticsearch.yml

cluster.name: graylog
path.data: /graylog-data/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 0.0.0.0
discovery.type: single-node

curl http://es-server:9200/_cluster/health?pretty (executed from GL-server)

{
  "cluster_name" : "graylog",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 20,
  "active_shards" : 20,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

GL /api/cluster

{
  "327f2801-d3ee-4215-bfe3-36016b482b56": {
    "facility": "graylog-server",
    "codename": "Noir",
    "node_id": "327f2801-d3ee-4215-bfe3-36016b482b56",
    "cluster_id": "9751f3c2-24bb-4523-9eb9-61c8902c7aec",
    "version": "4.1.1+27dec96",
    "started_at": "2021-07-08T20:18:16.969Z",
    "hostname": "gl-server",
    "lifecycle": "throttled",
    "lb_status": "throttled",
    "timezone": "Europe/Berlin",
    "operating_system": "Linux 5.4.0-77-generic",
    "is_processing": true
  }
}

Where could/should I look to figure out why nothing is written out?
Nothing is processed even if all inputs are stopped and system has no load at all.
So I assume not a load problem

Thanks for any help

Have you done a process buffer dump on the node to see if there’s a bad regex that could potentially be causing an issue (System–>Nodes–>More Actions)?

@chas0rde

If i can add some questions, what does you whole Graylog server configuration look like?
I see you have 16+ million message in you journal also I was wondering why your java heap is at 4GB?

I think that might not be correct. Overall I belive you have some configuration issues on your setup. Cant tell for sure unless I see you whole configuration.

@gsmith
Above /etc/graylog/server/server.conf is everything on that that is non standard
What else might you need?
Regarding the heap I followed the documentations recommendation to set it to half of RAM size (used to be 8G, now extended to 12G)

/etc/default/graylog-server

# Path to the java executable.
JAVA=/usr/bin/java

# Default Java options for heap and garbage collection.
GRAYLOG_SERVER_JAVA_OPTS="-Xms4g -Xmx4g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow"

# Avoid endless loop with some TLSv1.3 implementations.
GRAYLOG_SERVER_JAVA_OPTS="$GRAYLOG_SERVER_JAVA_OPTS -Djdk.tls.acknowledgeCloseNotify=true"

# Pass some extra args to graylog-server. (i.e. "-d" to enable debug mode)
GRAYLOG_SERVER_ARGS=""

# Program that will be used to wrap the graylog-server command. Useful to
# support programs like authbind.
GRAYLOG_COMMAND_WRAPPER=""

Best regards

@aaronsachs
The Dump contains 5 lines with a lot of stuff. Should I see any specific error there?
From what I see it looks like mostly data from our vmware vcenter and I do not see an error

EDIT
When I pull a thread dump a lot o

"main" id=1 state=WAITING
"Reference Handler" id=2 state=WAITING
"Finalizer" id=3 state=WAITING
"Signal Dispatcher" id=5 state=RUNNABLE
"cluster-ClusterId{value='60e75d8ad6cd330bcae611c8', description='null'}-localhost:27017" id=24 state=TIMED_WAITING
"CleanCursors-1-thread-1" id=25 state=TIMED_WAITING
"inputbufferprocessor-0" id=26 state=WAITING
"inputbufferprocessor-1" id=27 state=WAITING
"inputbufferprocessor-2" id=28 state=WAITING
"OkHttp ConnectionPool" id=29 state=TIMED_WAITING
Locked synchronizers: count = 1
  
"Okio Watchdog" id=30 state=TIMED_WAITING
"pool-10-thread-1" id=31 state=RUNNABLE (running in native)
"I/O dispatcher 1" id=32 state=RUNNABLE (running in native)
"I/O dispatcher 2" id=33 state=RUNNABLE (running in native)
"I/O dispatcher 3" id=34 state=RUNNABLE (running in native)
"I/O dispatcher 4" id=35 state=RUNNABLE (running in native)
"I/O dispatcher 5" id=36 state=RUNNABLE (running in native)
"I/O dispatcher 6" id=37 state=RUNNABLE
"I/O dispatcher 7" id=38 state=RUNNABLE
"I/O dispatcher 8" id=39 state=RUNNABLE (running in native)
"I/O dispatcher 9" id=40 state=RUNNABLE
"I/O dispatcher 10" id=41 state=RUNNABLE
"I/O dispatcher 11" id=42 state=RUNNABLE
"I/O dispatcher 12" id=43 state=RUNNABLE
"I/O dispatcher 13" id=44 state=RUNNABLE (running in native)
"I/O dispatcher 14" id=45 state=RUNNABLE (running in native)
"I/O dispatcher 15" id=46 state=RUNNABLE
"I/O dispatcher 16" id=47 state=RUNNABLE (running in native)
"I/O dispatcher 17" id=48 state=RUNNABLE
"I/O dispatcher 18" id=49 state=RUNNABLE (running in native)
"I/O dispatcher 19" id=50 state=RUNNABLE (running in native)
"I/O dispatcher 20" id=51 state=RUNNABLE
"scheduled-daemon-0" id=52 state=WAITING
"aws-instance-lookup-refresher-0" id=53 state=TIMED_WAITING
"outputbufferprocessor-0" id=54 state=WAITING
"outputbufferprocessor-1" id=55 state=WAITING
"outputbufferprocessor-2" id=56 state=WAITING
"aws-instance-lookup-refresher-0" id=57 state=TIMED_WAITING
"aws-instance-lookup-refresher-0" id=58 state=TIMED_WAITING
"aws-instance-lookup-refresher-0" id=59 state=TIMED_WAITING
"aws-instance-lookup-refresher-0" id=60 state=TIMED_WAITING
"processbufferprocessor-0" id=61 state=RUNNABLE
"processbufferprocessor-1" id=62 state=RUNNABLE
"processbufferprocessor-2" id=63 state=RUNNABLE
"processbufferprocessor-3" id=64 state=RUNNABLE
"processbufferprocessor-4" id=65 state=RUNNABLE
"scheduled-daemon-1" id=66 state=WAITING
"scheduled-daemon-2" id=67 state=WAITING
"eventbus-handler-0" id=68 state=WAITING
"InputSetupService" id=70 state=WAITING
"LocalKafkaMessageQueueReader" id=75 state=TIMED_WAITING
"scheduled-daemon-3" id=77 state=WAITING
"scheduled-0" id=79 state=WAITING
"scheduled-1" id=81 state=WAITING
"scheduled-2" id=83 state=WAITING
"scheduled-3" id=85 state=WAITING
"output-setup-service-0" id=87 state=WAITING
"scheduled-4" id=92 state=WAITING
"JobSchedulerService" id=93 state=TIMED_WAITING
"scheduled-daemon-4" id=95 state=WAITING
"scheduled-daemon-5" id=98 state=TIMED_WAITING
"scheduled-5" id=99 state=TIMED_WAITING
"scheduled-6" id=100 state=WAITING
"scheduled-daemon-6" id=101 state=WAITING
"scheduled-daemon-7" id=102 state=WAITING
"scheduled-daemon-8" id=103 state=WAITING
"scheduled-daemon-9" id=104 state=WAITING
"scheduled-daemon-10" id=105 state=WAITING
"scheduled-daemon-11" id=106 state=WAITING
"scheduled-daemon-12" id=107 state=WAITING
"scheduled-daemon-13" id=109 state=WAITING
"scheduled-daemon-14" id=111 state=WAITING
"scheduled-daemon-15" id=112 state=WAITING
"scheduled-daemon-16" id=113 state=WAITING
"scheduled-daemon-17" id=114 state=WAITING
"periodical-org.graylog2.periodical.IndexFailuresPeriodical-0" id=116 state=WAITING
"scheduled-daemon-18" id=117 state=WAITING
"scheduled-daemon-19" id=118 state=WAITING
"scheduled-daemon-20" id=119 state=WAITING
"scheduled-daemon-21" id=120 state=WAITING
"scheduled-daemon-23" id=123 state=WAITING
"scheduled-daemon-22" id=122 state=WAITING
"scheduled-daemon-24" id=124 state=WAITING
"scheduled-daemon-25" id=125 state=WAITING
"scheduled-daemon-26" id=127 state=WAITING
"scheduled-daemon-27" id=128 state=WAITING
"scheduled-daemon-28" id=129 state=WAITING
"scheduled-daemon-29" id=130 state=WAITING
"scheduled-7" id=131 state=TIMED_WAITING
"scheduled-8" id=134 state=WAITING
"scheduled-9" id=138 state=WAITING
"cluster-eventbus-handler-0" id=139 state=WAITING
"cluster-eventbus-handler-1" id=140 state=WAITING
"scheduled-10" id=141 state=WAITING
"scheduled-11" id=142 state=TIMED_WAITING
"eventbus-handler-1" id=143 state=WAITING
"scheduled-12" id=144 state=WAITING
"scheduled-13" id=145 state=WAITING
"scheduled-14" id=146 state=WAITING
"scheduled-15" id=147 state=WAITING
"HttpServer-0" id=148 state=TIMED_WAITING
Locked synchronizers: count = 1
  
"grizzly-nio-kernel(1) SelectorRunner" id=149 state=RUNNABLE (running in native)
"netty-transport-0" id=150 state=RUNNABLE
Locked synchronizers: count = 1
  
"netty-transport-1" id=151 state=RUNNABLE
Locked synchronizers: count = 1
  
"netty-transport-2" id=152 state=RUNNABLE
Locked synchronizers: count = 1
  
"netty-transport-3" id=156 state=RUNNABLE (running in native)
Locked synchronizers: count = 1
  
"netty-transport-4" id=157 state=RUNNABLE (running in native)
Locked synchronizers: count = 1
  
"netty-transport-0" id=158 state=RUNNABLE
Locked synchronizers: count = 1
  
"scheduled-16" id=159 state=WAITING
"scheduled-17" id=160 state=WAITING
"scheduled-18" id=161 state=WAITING
"scheduled-19" id=162 state=WAITING
"http-worker-0" id=163 state=WAITING
"http-worker-1" id=164 state=WAITING
"SessionValidationThread-1" id=165 state=TIMED_WAITING
"proxied-requests-pool-0" id=166 state=WAITING
"http-worker-2" id=167 state=WAITING
"http-worker-3" id=168 state=WAITING
"proxied-requests-pool-1" id=169 state=WAITING
"http-worker-4" id=170 state=WAITING
"http-worker-5" id=171 state=WAITING
"scheduled-20" id=172 state=WAITING
"scheduled-21" id=173 state=WAITING
"http-worker-6" id=174 state=WAITING
"http-worker-7" id=175 state=RUNNABLE
Locked synchronizers: count = 1
  
"proxied-requests-pool-2" id=176 state=WAITING
"http-worker-8" id=177 state=WAITING
"http-worker-9" id=178 state=WAITING
"proxied-requests-pool-3" id=179 state=WAITING
"scheduled-22" id=180 state=WAITING
"scheduled-23" id=181 state=WAITING
"http-worker-10" id=182 state=WAITING
"http-worker-11" id=183 state=WAITING
"http-worker-12" id=184 state=WAITING
"http-worker-13" id=185 state=WAITING
"proxied-requests-pool-4" id=186 state=WAITING
"http-worker-14" id=187 state=WAITING
"http-worker-15" id=188 state=RUNNABLE (running in native)
Locked synchronizers: count = 1
  
"proxied-requests-pool-5" id=189 state=WAITING
"scheduled-24" id=190 state=WAITING
"scheduled-25" id=191 state=WAITING
"netty-transport-0" id=192 state=RUNNABLE
Locked synchronizers: count = 1
  
"netty-transport-1" id=193 state=RUNNABLE (running in native)
Locked synchronizers: count = 1
  
"proxied-requests-pool-6" id=194 state=WAITING
"scheduled-26" id=195 state=WAITING
"scheduled-27" id=196 state=WAITING
"proxied-requests-pool-7" id=197 state=WAITING
"scheduled-28" id=198 state=WAITING
"scheduled-29" id=199 state=WAITING
"proxied-requests-pool-8" id=200 state=WAITING
"proxied-requests-pool-9" id=201 state=WAITING
"proxied-requests-pool-10" id=202 state=WAITING
"proxied-requests-pool-11" id=203 state=WAITING
"proxied-requests-pool-12" id=204 state=WAITING
"proxied-requests-pool-13" id=205 state=WAITING
"proxied-requests-pool-14" id=206 state=WAITING
"proxied-requests-pool-15" id=207 state=WAITING
"proxied-requests-pool-16" id=208 state=WAITING
"proxied-requests-pool-17" id=209 state=WAITING
"proxied-requests-pool-18" id=210 state=WAITING
"proxied-requests-pool-19" id=211 state=WAITING
"proxied-requests-pool-20" id=212 state=WAITING
"proxied-requests-pool-21" id=213 state=WAITING
"proxied-requests-pool-22" id=214 state=WAITING
"kafka-journal-scheduler-0" id=215 state=WAITING
"metrics-meter-tick-thread-1" id=216 state=TIMED_WAITING
"proxied-requests-pool-23" id=217 state=WAITING
"proxied-requests-pool-24" id=218 state=WAITING
"proxied-requests-pool-25" id=219 state=WAITING
"proxied-requests-pool-26" id=220 state=WAITING
"proxied-requests-pool-27" id=221 state=WAITING
"proxied-requests-pool-28" id=222 state=WAITING
"netty-transport-0" id=223 state=RUNNABLE (running in native)
Locked synchronizers: count = 1
  
"proxied-requests-pool-29" id=224 state=WAITING
"metrics-meter-tick-thread-2" id=225 state=WAITING
"proxied-requests-pool-30" id=226 state=WAITING
"proxied-requests-pool-31" id=227 state=WAITING
"kafka-journal-scheduler-1" id=228 state=WAITING
"query-engine-0" id=229 state=WAITING
"outputbuffer-processor-executor-0" id=264 state=WAITING
"query-engine-1" id=319 state=WAITING
"query-engine-2" id=320 state=WAITING
"query-engine-3" id=321 state=WAITING
"outputbuffer-processor-executor-0" id=323 state=WAITING
"outputbuffer-processor-executor-0" id=325 state=WAITING
"outputbuffer-processor-executor-1" id=327 state=WAITING
"outputbuffer-processor-executor-1" id=689 state=WAITING
"outputbuffer-processor-executor-1" id=2405 state=WAITING
"outputbuffer-processor-executor-2" id=21813 state=WAITING
"outputbuffer-processor-executor-2" id=21814 state=WAITING
"outputbuffer-processor-executor-2" id=21815 state=WAITING
"netty-transport-1" id=49721 state=RUNNABLE (running in native)
Locked synchronizers: count = 1

Shortened the output

Last Line originally says:


    Locked synchronizers: count = 1
      - java.util.concurrent.ThreadPoolExecutor$Worker@4b9c33f6

Hello,

To be honest I’ve not seen just those setting in the configuration file normally there are other settings that was need specially if your using https. For example here is my Graylog Configuration file and I’m running 12CPU’s,10GB memory on one Virtual machine and configured it to use https/tls. You’ll notice that I have increased the following but I left one core for the system. NOTE: I’m ingesting 30 GB of logs a day.

processbuffer_processors = 6
outputbuffer_processors = 3
inputbuffer_processors = 2
Graylog_config
[root@graylog elasticsearch]# grep -v "^#\|^$" /etc/graylog/server/server.conf
is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret =long_string
root_password_sha2 =long_string
root_email = "greg.smith@domain.com"
root_timezone = America/Chicago
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 8.8.8.8:9000
http_publish_uri = https://graylog.domain.com:9000/
http_enable_cors = true
 http_enable_tls = true
http_tls_cert_file = /etc/ssl/certs/graylog/graylog-certificate.pem
http_tls_key_file = /etc/ssl/certs/graylog/graylog-key.pem
http_tls_key_password = secret
elasticsearch_hosts = http://8.8.8.8:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = true
allow_highlighting = false
elasticsearch_analyzer = standard
elasticsearch_index_optimization_timeout = 1h
output_batch_size = 5000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 6
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_size = 12gb
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://mongo_admin:password@localhost:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_enabled = true
transport_email_hostname = localhost
tansport_email_port = 25
transport_email_subject_prefix = [graylog]
transport_email_from_email = root@domain.com
transport_email_web_interface_url = https://8.8.8.8:9000
http_connect_timeout = 10s
proxied_requests_thread_pool_size = 32
[root@graylog elasticsearch]#

For troubleshooting, have you tried Graylog setup without Nginx?
Correct me if I’m wrong but didnt you state you have two Elasticsearch servers, if so you may not need this config.

Here is my ES.yml file which look almost the same.

ES_Config
[root@graylog elasticsearch]# grep -v "^#\|^$" /etc/elasticsearch/elasticsearch.yml
cluster.name: graylog
path.data: /var/lib/elasticsearch
path.logs: /var/log/elasticsearch
network.host: 8.8.8.8
http.port: 9200
action.auto_create_index: false
discovery.type: single-node

Edit: The reason I stated it might be a configuration error was it seems your Proccess Buffer is at 100%. Correct me if I’m wrong but you did not see a error/s in Graylog log file that may pertain to a connection issue to elasticsearch? I have found that its normally a setting in graylog configurtion file that needed or resources, but it seems that you have the resources available.

1 Like

What’s the ES heap set to? If you’ve not explicitly set it, it will take the default of 1GB, which may not be enough.

@gsmith
I did not change the processor settings (afaik). Currently set as follows

processbuffer_processors = 5
outputbuffer_processors = 3
inputbuffer_processors = 2

There is only on ES running (was on the same host before but I migrated it to a separate machine a while ago). Therefore I guess the discovery.type should be ok.

Here is my full graylog config (all lines that do not start with a comment #)

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = long_string
root_password_sha2 = long_string
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 0.0.0.0:9000
http_external_uri = https://graylog.some.tld/
elasticsearch_hosts = http://es-host:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = true
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_age = 48h
message_journal_max_size = 20gb
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://localhost/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
proxied_requests_thread_pool_size = 32

ES Config looks quite the same yes.

As far as I see there are no errors in the log regarding connectivity issues.
ES API is reachable from graylog as well and shows green state
Yes the processbuffer is at 100%. Input and Output are empty.
No messages being output…really have no idea why :frowning:

@aaronsachs
Heap is 4G

-Xms4g
-Xmx4g

Hello,

I just want to sum up your environment, correct me if I’m wrong.
Elasticsearch was moved to a different server so it’s separated from Graylog/MongoDb?
Did you create a new Elasticsearch, or did you execute a dump & restore?
Nginx is you load balancer and it’s configured to redirect http → https?
Graylog is configured to use http_external_uri /w https?

This is unusual, I’ve seen the Buffers full before and that was resolved by placing the correct configuration in Graylog config file, not enough resources or there was a permission issue.
By chance when all your services are running (GL. MongoDb, and ES) do you see the Journal Utilization going down at all? I had over 10GB of logs in my Journal and it took my server about a little over an hour to process. Just a thought.
Have you tried to increase your processbuffer_processors and restart Graylog service?
See any permission issues?
Is there a firewall enabled? If so, did you check Elasticsearch for allowing port 9200 through firewall. Do you have Selinux/AppArmor enable/installed?
I can’t really think of anything else. There was someone in the form had problems similar and it was his Nginx causing these problems.
Sorry I can’t be more help.

Elasticsearch was moved to a different server so it’s separated from Graylog/MongoDb?

Yes

Did you create a new Elasticsearch, or did you execute a dump & restore?

New ES. Data was not migrated

Nginx is you load balancer and it’s configured to redirect http → https?

Apache2 is my load balancer. Does HTTP->HTTPS Redirect
Serves as proxy HTTPS->Port 9000

Graylog is configured to use http_external_uri /w https?

Yes

By chance when all your services are running (GL. MongoDb, and ES) do you see the Journal Utilization going down at all?

I tried to disable all inputs to see if the GL would process the journal then but nothing.

Have you tried to increase your processbuffer_processors and restart Graylog service?

Not yet

See any permission issues?

Nothing that is visible in logs as far as I see

Is there a firewall enabled? Do you have Selinux/AppArmor enable/installed?

Yes UFW is running. Ports are configured between the systems.
No SELinux/AppArmor

Strangely: Today I looked at the GL and it was processing. Haven’t had time to do anything on it for the past two days. So no changes from my end. No idea why its working now :frowning:
Will check if it stays that way

Regarding the initial idea with the process buffer dump:
Is it possible that malformed packets (e.g. from stupid firewalls) cause the processing to come to a grinding halt? Assuming that all Extractors are valid regex.
Or shouldn’t that happen. And if so:

  • Is there a clear warning somewhere telling me about it?
  • Is it fixable (e.g. by somehow kicking messages out of the processing chain)?
  • Should inputs be separated by source system somehow (if possible)?

This still smacks of a bad regex in a pipeline to me. Did you do a processbuffer dump? I saw the thread dump, but I may have missed the processbuffer one.

The other thing you could do is enable debug metrics in your pipeline and then take the data generated there and see if there’s some pipeline rule that’s running for an inordinately long amount of time. If that ends up being the case, then I’d say that’s the cause and try and figure out what the rule is doing.

But now, since I saw Indices in Graylog, I’m wondering what the rest of the cluster looks like. I’d be keen to know the output of:

curl -X GET "localhost:9200/_cluster/health?pretty" from Elasticsearch

And

curl -XGET -u admin "http://localhost:9000/api/system/indices/index_sets" from Graylog.

2 Likes

This still smacks of a bad regex in a pipeline to me. Did you do a processbuffer dump? I saw the thread dump, but I may have missed the processbuffer one.

I did but it was too long to post. Did not store it anywhere though so no way of getting it back at the moment until the problem might re-appear

The other thing you could do is enable debug metrics in your pipeline and then take the data generated there and see if there’s some pipeline rule that’s running for an inordinately long amount of time. If that ends up being the case, then I’d say that’s the cause and try and figure out what the rule is doing.

Will try if it happens again. Disabled some input systems for now (firewall and some other systems also shipping to GL before. Maybe sth in there killed the buffer.
Currently only ingesting Network-logs and Active Directory. No issues there at the moment

{
  "total": 4,
  "index_sets": [
    {
      "id": "60c333d48b01395b7ad125ed",
      "title": "Default index set",
      "description": "The Graylog default index set",
      "index_prefix": "graylog",
      "shards": 4,
      "replicas": 0,
      "rotation_strategy_class": "org.graylog2.indexer.rotation.strategies.TimeBasedRotationStrategy",
      "rotation_strategy": {
        "type": "org.graylog2.indexer.rotation.strategies.TimeBasedRotationStrategyConfig",
        "rotation_period": "P1M"
      },
      "retention_strategy_class": "org.graylog2.indexer.retention.strategies.DeletionRetentionStrategy",
      "retention_strategy": {
        "type": "org.graylog2.indexer.retention.strategies.DeletionRetentionStrategyConfig",
        "max_number_of_indices": 20
      },
      "creation_date": "2021-06-11T09:58:44.813Z",
      "index_analyzer": "standard",
      "index_optimization_max_num_segments": 1,
      "index_optimization_disabled": false,
      "field_type_refresh_interval": 5000,
      "index_template_type": null,
      "writable": true,
      "default": true
    },
    {
      "id": "60c333da8b01395b7ad126b7",
      "title": "Graylog Events",
      "description": "Stores Graylog events.",
      "index_prefix": "gl-events",
      "shards": 4,
      "replicas": 0,
      "rotation_strategy_class": "org.graylog2.indexer.rotation.strategies.TimeBasedRotationStrategy",
      "rotation_strategy": {
        "type": "org.graylog2.indexer.rotation.strategies.TimeBasedRotationStrategyConfig",
        "rotation_period": "P1M"
      },
      "retention_strategy_class": "org.graylog2.indexer.retention.strategies.DeletionRetentionStrategy",
      "retention_strategy": {
        "type": "org.graylog2.indexer.retention.strategies.DeletionRetentionStrategyConfig",
        "max_number_of_indices": 12
      },
      "creation_date": "2021-06-11T09:58:50.907Z",
      "index_analyzer": "standard",
      "index_optimization_max_num_segments": 1,
      "index_optimization_disabled": false,
      "field_type_refresh_interval": 60000,
      "index_template_type": "events",
      "writable": true,
      "default": false
    },
    {
      "id": "60c333da8b01395b7ad126b9",
      "title": "Graylog System Events",
      "description": "Stores Graylog system events.",
      "index_prefix": "gl-system-events",
      "shards": 4,
      "replicas": 0,
      "rotation_strategy_class": "org.graylog2.indexer.rotation.strategies.TimeBasedRotationStrategy",
      "rotation_strategy": {
        "type": "org.graylog2.indexer.rotation.strategies.TimeBasedRotationStrategyConfig",
        "rotation_period": "P1M"
      },
      "retention_strategy_class": "org.graylog2.indexer.retention.strategies.DeletionRetentionStrategy",
      "retention_strategy": {
        "type": "org.graylog2.indexer.retention.strategies.DeletionRetentionStrategyConfig",
        "max_number_of_indices": 12
      },
      "creation_date": "2021-06-11T09:58:50.937Z",
      "index_analyzer": "standard",
      "index_optimization_max_num_segments": 1,
      "index_optimization_disabled": false,
      "field_type_refresh_interval": 60000,
      "index_template_type": "events",
      "writable": true,
      "default": false
    },
    {
      "id": "60c333d58b01395b7ad125fb",
      "title": "Restored Archives",
      "description": "Indices which have been restored from an archive.",
      "index_prefix": "restored-archive",
      "shards": 4,
      "replicas": 0,
      "rotation_strategy_class": "org.graylog2.indexer.rotation.strategies.MessageCountRotationStrategy",
      "rotation_strategy": {
        "type": "org.graylog2.indexer.rotation.strategies.MessageCountRotationStrategyConfig",
        "max_docs_per_index": 2147483647
      },
      "retention_strategy_class": "org.graylog2.indexer.retention.strategies.NoopRetentionStrategy",
      "retention_strategy": {
        "type": "org.graylog2.indexer.retention.strategies.NoopRetentionStrategyConfig",
        "max_number_of_indices": 2147483647
      },
      "creation_date": "2021-06-11T09:58:39.941Z",
      "index_analyzer": "standard",
      "index_optimization_max_num_segments": 1,
      "index_optimization_disabled": false,
      "field_type_refresh_interval": 5000,
      "index_template_type": null,
      "writable": false,
      "default": false
    }
  ],
  "stats": {}
}