No more messages flowing inbound? Started over twice now... what am I doing wrong?

Before you post: Your responses to these questions will help the community help you. Please complete this template if you’re asking a support question.
Don’t forget to select tags to help index your topic!

1. Describe your incident:

Installed ubuntu 20 and followed the guide to install graylog - set up three DCs and our Meraki gear to forward syslog traffic to port 1515. Also tinkered with nxlog forwarding with a beats input on port 5044.

Everything works well for a day or two and then messages stop flowing inbound.

2. Describe your environment:

  • OS Information:
    Ubuntu 20.04 on hyper-v VM
    8 cores Xeon Gold 6148 CPU
    24GB memory

  • Package Version:
    ii elasticsearch-oss 7.10.2 amd64 Distributed RESTful search engine built for the cloud
    ii graylog-4.2-repository 1-4 all Package to install Graylog 4.2 GPG key and repository
    ii graylog-integrations-plugins 4.2.5-1 all Graylog Integrations plugins
    ii graylog-server 4.2.5-1 all Graylog server
    ii mongodb-org 4.0.28 amd64 MongoDB open source document-oriented database system (metapackage)
    ii mongodb-org-mongos 4.0.28 amd64 MongoDB sharded cluster query router
    ii mongodb-org-server 4.0.28 amd64 MongoDB database server
    ii mongodb-org-shell 4.0.28 amd64 MongoDB shell client
    ii mongodb-org-tools 4.0.28 amd64 MongoDB tools

  • Service logs, configurations, and environment variables:
    server.conf file:
    is_master = true
    node_id_file = /etc/graylog/server/node-id
    password_secret =
    root_password_sha2 =
    bin_dir = /usr/share/graylog-server/bin
    data_dir = /var/lib/graylog-server
    plugin_dir = /usr/share/graylog-server/plugin
    http_bind_address = 10.10.10.27:9000
    http_enable_cors = false
    rotation_strategy = count
    elasticsearch_max_docs_per_index = 20000000
    elasticsearch_max_number_of_indices = 20
    retention_strategy = delete
    elasticsearch_shards = 4
    elasticsearch_replicas = 0
    elasticsearch_index_prefix = graylog
    allow_leading_wildcard_searches = false
    allow_highlighting = false
    elasticsearch_analyzer = standard
    output_batch_size = 500
    output_flush_interval = 1
    output_fault_count_threshold = 5
    output_fault_penalty_seconds = 30
    processbuffer_processors = 5
    outputbuffer_processors = 3
    processor_wait_strategy = blocking
    ring_size = 65536
    inputbuffer_ring_size = 65536
    inputbuffer_processors = 2
    inputbuffer_wait_strategy = blocking
    message_journal_enabled = true
    message_journal_dir = /var/lib/graylog-server/journal
    lb_recognition_period_seconds = 3
    mongodb_uri = mongodb://localhost/graylog
    mongodb_max_connections = 1000
    mongodb_threads_allowed_to_block_multiplier = 5
    proxied_requests_thread_pool_size = 32

3. What steps have you already taken to try and solve the problem?

tail -f /var/log/graylog-server/server.log
2022-02-01T15:38:10.404-06:00 WARN [LocalKafkaJournal] Journal utilization (101.0%) has gone over 95%.
2022-02-01T15:38:41.843-06:00 INFO [connection] Opened connection [connectionId{localValue:17, serverValue:17}] to localhost:27017
2022-02-01T15:38:41.850-06:00 INFO [connection] Opened connection [connectionId{localValue:15, serverValue:15}] to localhost:27017
2022-02-01T15:38:41.851-06:00 INFO [connection] Opened connection [connectionId{localValue:16, serverValue:16}] to localhost:27017
2022-02-01T15:38:41.851-06:00 INFO [connection] Opened connection [connectionId{localValue:14, serverValue:12}] to localhost:27017
2022-02-01T15:38:41.852-06:00 INFO [connection] Opened connection [connectionId{localValue:13, serverValue:13}] to localhost:27017
2022-02-01T15:38:41.854-06:00 INFO [connection] Opened connection [connectionId{localValue:11, serverValue:11}] to localhost:27017
2022-02-01T15:38:41.857-06:00 INFO [connection] Opened connection [connectionId{localValue:12, serverValue:14}] to localhost:27017
2022-02-01T15:39:11.056-06:00 WARN [LocalKafkaJournal] Journal utilization (98.0%) has gone over 95%.
2022-02-01T15:39:11.058-06:00 INFO [LocalKafkaJournal] Journal usage is 98.00% (threshold 100%), changing load balancer status from THROTTLED to ALIVE

Journal usage overrun maybe?

curl -XGET http://localhost:9200/_cluster/health?pretty=true
{
“cluster_name” : “graylog”,
“status” : “green”,
“timed_out” : false,
“number_of_nodes” : 1,
“number_of_data_nodes” : 1,
“active_primary_shards” : 20,
“active_shards” : 20,
“relocating_shards” : 0,
“initializing_shards” : 0,
“unassigned_shards” : 0,
“delayed_unassigned_shards” : 0,
“number_of_pending_tasks” : 0,
“number_of_in_flight_fetch” : 0,
“task_max_waiting_in_queue_millis” : 0,
“active_shards_percent_as_number” : 100.0
}

I’m not sure what a shard is - but assuming if it’s 100% then it’s full or too busy to process more. What can I do to get messages flowing again? And most importantly please help me understand why this happened to prevent it from reoccurring. (Like deleting a folder of queued messages is nice to know how and what to do, but why didn’t they process and what settings do I need to adjust for proper retention and automatic digesting)?

4. How can the community help?

I’m guessing my elasticsearch settings are wrong or perhaps the server isn’t powerful enough? I cannot get into the /etc/elasticsearch directory as access is denied - is that expected? I don’t want to start modifying file permissions without understanding why or what that might break.

How can I get messages flowing again?

Helpful Posting Tips: Tips for Posting Questions that Get Answers [Hold down CTRL and link on link to open tips documents in a separate tab]

Hello && Welcome

This warning mean that your Journal utilization full. It could be a couple different things.

  1. Check your Elasticsearch status (systemctl status elasticsearch) and ES log files.
  2. Perhaps Increase your Journal from default of 5 Gb to maybe 10 GB IF you have enough room on your Drive to do so.
  3. If your process buffer are shown full on the Web UI you may want to increase that, they would be in your server.conf file **processbuffer_processors = 5 ** and wait. This may take time to complete and the logs in your journal. How much resource you have and/or how much messages GL is ingesting.

Your process buffer is the big hitter, then your output buffer. If buffers are full, I would only increase the Process buffer to 6 and wait to see if it goes down. Take note when you GL Volume is full, or your journal is full, or your buffers are at 100% logs coming in will pause. You will be unable to see those messages until your issue is resolved. Normally when the journal is full this means the resource are unable to keep up with the amount of messages coming in or Elasticsearch issue.

So the user you logged into Graylog server does not have the permission to that directory. So this seam like a permission issue also. Check your Server logs Elasticsearch/Graylog.

EDIT: Since your using nxlog I had a problem a while back one Device was sending over 12,000 messages. This was from an nxlog client and it filled my journal up really quick over tim you may want to check your shippers.

The user I am logged in with is the user I installed graylog with but as I am not able to see inside the /etc/elasticsearch/ folder - I’m guessing this might be an issue? Maybe once the journal fills up it cannot write them to elastic? Or am I miss understanding the flow? Is it journal → elastic → mongo?

Status is up and running but I’m not sure how to access the logs since I can’t see in the directory? Is it best to chmod the elasticsearch folder or better to give my user root access, even if temporarily?

Messages do seem to skyrocket when I first opened the ports for ingestion (4000/second messages) but eventually slowed down to 150/second or so… No idea what graylog is capable of or if I’m choking it with data or not.

Seeing as I’m using ssh to do everything - what are good ways to check the journal size, hard drive space, etc? How do I increase the journal size? The server has either 300 or 600gb for its storage.

What happens to old logs - assuming it deletes by default per the server config - but is there a way to archive them onto an SMB share or such instead for longevity?

Yes its probably combination of a few issue you may have.

If the user you installed Graylog with has root permission then you could adjust it, but if the user you install Graylog doesn’t not have permission to /etc/elasticsearch directory you may want to talk to the local admin about getting access.

journal is in the Web UI under System/Node

root # df -h

message_journal_max_size = 5gb

You have a lot going on. I would first find out WHY you journal is filling up over it capacity and resolve that issue.

Also…

If you don’t have access to log files then you have more problems

I might have some insight regarding your (in)ability to access certain files and directories in Ubuntu. I’m not a Linux person, so it’s possible I’m violating best practice here. Please take it with a grain of salt.

I’ve noticed that on Ubuntu server (I’m assuming you installed Ubuntu 20.04 LTS), the ‘administrative’ user you make as part of the install isn’t actually a root user (which I think is on purpose?). You can ‘sudo’ to do administrative commands, and edit files that require root access (e.g. “sudo nano /etc/graylog/server/server.conf”), but you can’t actually browse those directories as the admin user. This is where I may be violating best practice, but if you do “sudo su” you’ll have access to a root command prompt, which will let you ‘cd’ and ‘ls’ any directory (including /etc/elasticsearch). I’m pretty sure that’s all operating as intended, and it may not actually be a problem if your normal admin user can’t directly access those directories.

I hope that helps (and that I haven’t lead you astray).

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.