How to debug missing messages

Hi!

I am using Graylog 4.2.5 and I am currently trying to let different systems send logs to it.
In the past, I had several times the problem, that messages, that were sent to syslog-inputs were missing because of format problems.

If I created “RAW”-inputs, everything has been visible, but I had to define extractors, etc.

Now, I am trying to let Arista switches send their logs to Graylog.
→ Created RAW inputs for UDP and TCP.
→ Sending syslog to them (tried both).

But: There are no messages.
Graylog server.log does not show anything.
Tried another syslog-server, which is showing:

Jan 14 14:06:36 switch-i2-01311-1 Cli: SYS-5-CONFIG_I: Configured from console by admin on vty6 (10.49.0.180)

How can I debug the problem?
tcpdump shows up, that messages are arriving (high port).

Thank you for your help
KPS

Additional Info:
For Syslog-Input, the messages seemed to be “in the future”, although I did configure, that the input “is not allowed to overwrite timestamp”. Is there any easy fix?

Second issue:
One of three Graylog cluster nodes is receiving logs, but not forwarding them to ES.
The “nodes”-overview is showring:

graylog01: In 200 / Out 0 msg/s
The journal contains 2,179,453 unprocessed messages in 20 segments
graylog02: In 300 / Out 200 msg/s
The journal contains 21 unprocessed messages in 1 segment. 7 messages appended, 7 messages read in the last second.
graylog03: In 100 / Out 160 msg/s
The journal contains 28 unprocessed messages in 1 segment. 28 messages appended, 28 messages read in the last second.

Heap space is not exhausted and server.log is not showing anything.
I did delete the journal and it seems to be working, now.

Thank you for your thoughts.
KPS

Hello @KPS

I’m sure most of us here use multiple INPUTs and as you noticed the Raw/Plaintext UDP /TCP
accepts different formatted messages. I personally don’t use one type of INPUT for every device. For example:

  • Switch devices use Raw/Plaintext UDP
  • Firewall Devices (exception if they can send a syslog format) Syslog TCP/UDP, If not, I put them on a Raw/Plaintext UDP INPUT
  • Windows Devices uses either Beats, or GELF TCL/TLS
  • Linux Devices Either Beats or GELF TCP/TLS

Why do I do this?
Because not only is it easier to allow an input to accept messages/logs from those device but I get the fields I need or I’m able create fields just for those devices.

Not only do I separate the devices by input BUT I also use different ports.

NOW , as for your original question. Missing Messages.

You have some resource issues for sure, Either you have a ton of physical CPU cores and your Graylog Server configuration file is misconfigured, meaning “process buffer/s” or you do need more physical CPU core added to Graylog. It also could be that your elasticsearch is no longer working.

Unfortunately I cant tell where your issue is directed at. You would need to share more information on your environment for us to help you.

Hi!

Sorry for the late reply. It happend one more time. All my three nodes did stop processing any logs in the journal. The journal did just get bigger while not sending anything to ES:

Here the details:

I did set up a graylog 4.2.5 cluster with:
3x GL+MongoDB-nodes (6GB heap, each), 8 cores
3x ES-nodes (30 GB heap, each), 8 cores

Log volume is currently about 20 GB/d and the system is not under much load.
Filesystem is not full, ES-cluster is “green”.

After some days runtime, all the GL nodes stop processing logs.

→ In 103 / Out 0 msg/s. The journal contains 1,369,976 unprocessed messages in 11 segments. 88 messages appended, 0 messages read in the last second.

The only way out is, to delete the journal and restart graylog.

I found that message, but I am not sure, if it is the reason and why it happens:

2022-01-18T10:00:24.263Z WARN [MappedMessage] Could not transform CEF field [type] according to standard. Skipping.
2022-01-18T10:02:28.820Z ERROR [PivotAggregationSearch] Aggregation search query returned an error: Elasticsearch exception [type=index_not_found_exception, reason=no such index ].

Do you have any idea about that?

Thank you and best wishes

Hello,

Yes I do, as I stated above in my last post.

Unfortunately I don’t know how you set up your environment or any configuration to help you further.

Hi!

Sorry, I forgot that information:
Currently, I am working with:

processbuffer_processors = 5
outputbuffer_processors = 3
inputbuffer_processors = 2
processor_wait_strategy = blocking

What I do not understand: If there would be too few output/processing buffers, IMHO the journal should slowly fill because more messages are incoming than going out.
And CPU-percentage is only 5-6%

But the behaviour is different: The system stops processing. Journal is filling up, while NO messages are reaching elasticsearch.

So:

  • Do you think, my config is o.k. for 8 physical cores?
  • Can you give me a hint, why graylog could completely stop processing instead of getting “slow”?

Is there anything else of my configuration, that is important for that case except cluster layout and buffer config?

Thank you for your help and best wishes

Hi!

It just happened again…

Journal is constantly filling and none of the messages are reaching output.

I tried to increase outputbuffer_processors and processbuffer_processors and the number of CPUs to 16, but the problem is still the same:
Journal is filling up, but nothing is sent to ES.

I also tried to increase output_batch_size from 500 to 5000 but same problem…
I also stopped the incoming messages, but although there is absolutely zero input, the journal size does not get smaller.

server.log does only show:

2022-01-19T13:58:59.454Z WARN  [LocalKafkaJournal] Journal utilization (101.0%) has gone over 95%.
2022-01-19T13:58:59.455Z INFO  [LocalKafkaJournal] Journal usage is 101.00% (threshold 100%), changing load balancer status from ALIVE to THROTTLED
2022-01-19T13:59:59.453Z WARN  [LocalKafkaJournal] Journal utilization (103.0%) has gone over 95%.

Currently, my only way out seems to be deleting the journal again.

Do you have any idea, what I can try next?

Maybe post (obfuscated and formatted with </> tool) the configuration (/etc/graylog/server/server.conf) file for graylog01… the one that is not working… and perhaps the configuration of one that is working? If there is not a misconfiguration there, it is possible that that graylog01 is capturing a message and getting into a GROK loop or similar.

random thought - check to make sure that Graylog01 knows about your entire elastic cluster on ITS DNS…

Hello

Adding on to what @tmacgbay stated, Check your elasticsearch/s logs/status also.

How much logs are you ingesting?

Hi!

Thank you for your answer.

I am currently only sending about 200-400 logs/s to the system which is about 20GB/d. Elasticsearch does not show any errors.

But: Currently, I think, I have an approach:
The problems did start in a time frame, where I also added the extractors for VMWare:

I think, one of these extracors seems to bring graylog message processing in some kind of loop. When that happens, processing stops.

→ I disabled these (194) extractors of the content pack and I am now waiting, if the problem does re-occure.

→ Is there any possibility to see which extractor does have that problem?

Here my config:

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = xxxx
root_password_sha2 = xxxx
root_timezone = Europe/Berlin
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 0.0.0.0:9000
elasticsearch_hosts = http://es-node-01:9200,http://es-node-03:9200,http://es-node-03:9200
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 60
retention_strategy = delete
elasticsearch_shards = 2
elasticsearch_replicas = 1
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = true
allow_highlighting = true
elasticsearch_analyzer = standard
output_batch_size = 5000
output_flush_interval = 1
processbuffer_processors = 14
outputbuffer_processors = 6
processor_wait_strategy = sleeping
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_age = 1h
message_journal_max_size = 100mb
message_journal_segment_size = 10mb
lb_recognition_period_seconds = 3
stream_processing_timeout = 200
stream_processing_max_faults = 3
mongodb_uri = mongodb://graylog:xxxx@10.0.49.61:27017,10.0.49.62:27017,10.0.49.63:27017/graylog?replicaSet=rs0
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
transport_email_enabled = true
transport_email_hostname = xxx
transport_email_port = 25
transport_email_use_auth = false
transport_email_from_email = graylog@xxx.de
transport_email_use_tls = false
transport_email_use_ssl = false
proxied_requests_thread_pool_size = 32

Thank you for your help!

Older content packs often have compatibility issues. I would examine each extractor and test to see how they do and enable them individually… kind of a pain for that many!! Perhaps enable in binary search manner.

If you look at the node (System->Nodes) then click on more actions… you can check process buffer dump… in the past I have seen things caught in a loop there that helped point to where I needed to fix… still hard to tack down though.

I notice you only have 2 shards and 1 replica… although this is only initial settings in the conf file. It is likely you changed in the GUI to allow data to spread more evenly (just making sure)

Hello,
A couple things I did notice in your Graylog Configuration file. I’m comparing my lab Graylog server which does around 500 mps and about 30-35 GB logs a day. And I’m running one single graylog server.

To avoid other problems probably raise this to 2 or 3 GB. 100MB can fill up quick which is might be the reason why you have in 20 segments.

Next,
I’m assuming this below is not the exact configuration. I would only use that setting (0.0.0.0) if I have one server for testing, maybe or the Graylog server does not have a static IP Address.

Next,
These three setting should not be greater then the Physical CPU cores you have. In your configuration file shows that your trying to create 22 cpu threads from 8 CPU cores. Problems may occur.

processbuffer_processors = 14
outputbuffer_processors = 6
inputbuffer_processors = 2

Next

Do you have over 30 GB of memory on each Elasticsearch server?, If not, I would lower your heap to something more manageable like 5 GB per ES node.

I agree with @tmacgbay

Next,

Holy cow, that’s a lot of extractors which may require more resources like CPU. After looking over your configurations and what you installed on your Graylog cluster this is starting to look like to me is when your having a problem/issue you raise the settings to much or max them out to resolve your issue. Sometime its a fin balance between what you want and what you have to work with.

  • What I would do and I think your doing it is disable your Content Pack from here ( which should disable all those extractors)
  • Next try setting all your graylog server file back to the original setting before you started trouble shooting this issue. The maybe adjusting them to appropriate setting from the amount of resources you have on each Graylog Server.
  • Ingest some logs and watch

Hope that helps, and keep us updated like to know how you resolved this issue :slight_smile:

1 Like

Hi!

Thank you for your great post.
I think, the initial issue is related to the extractors of the content pack. Since I removed it, the issue did not come back.

About your other points:

  • processors: I did change the config yesterday (increaed hardware-core-count and config). Sorry for the confusion. Now I went back to: 4(processor), 2 (input) ,2 (output)-processors and 12 cores

  • ES-nodes
    The nodes have 64GB memory and 32GB heap

  • journal size
    Is increased to 3GB

  • http_bind_address = 0.0.0.0:9000
    Can you explain me, why this is a problem? Why does it matter? I have to bind the interface IP to be able to communicate within the cluster. Wouldn’t a change just exclude “loopback”?

One more time: Thank you for your help!!

What do your buffers look like when your journal fills up? Based on what they are doing (or not doing) would help identify what/where you need to troubleshoot.

Here’s a post I wrote a while back on Journal filling up that might help with understanding the message ingestion cycle.

Couple additional thoughts…

  • min/max for the java heap size should be the same.
  • heap size should not be more than 50% of system memory or 32GB, which ever is smaller.
  • at 20GB/day, you might consider raising your journal to something more than 3GB. perhaps 10GB? That will give you at least half a day in case something happens to message processing.
  • Always leave at lest 1 core for your system. you have 12, allocate no more than 11. you have 8 which is plenty for the load you have.
  • consider the infrastructure, what kind of storage do you have? Single Drive? RAID? SAN? SSD? HDD? this can make a big difference once your ingest starts going up or you start having more people in the system running queries, etc.
1 Like

Input-buffer was empty, output, too. Processing buffer was 100% full, but not processing any messages.

Check

Check, already set up.

Check, 10 GB

Check → 12C

Currently its only one SSD per Node, ES-cluster is currently showing VERY few I/O and nearly no I/O-wait. But I can add additional disks, if I/O will become a bottleneck.

Thank you for your help!