I am using Graylog 4.2.5 and I am currently trying to let different systems send logs to it.
In the past, I had several times the problem, that messages, that were sent to syslog-inputs were missing because of format problems.
If I created “RAW”-inputs, everything has been visible, but I had to define extractors, etc.
Now, I am trying to let Arista switches send their logs to Graylog.
→ Created RAW inputs for UDP and TCP.
→ Sending syslog to them (tried both).
But: There are no messages.
Graylog server.log does not show anything.
Tried another syslog-server, which is showing:
Jan 14 14:06:36 switch-i2-01311-1 Cli: SYS-5-CONFIG_I: Configured from console by admin on vty6 (10.49.0.180)
How can I debug the problem?
tcpdump shows up, that messages are arriving (high port).
Additional Info: For Syslog-Input, the messages seemed to be “in the future”, although I did configure, that the input “is not allowed to overwrite timestamp”. Is there any easy fix?
Second issue:
One of three Graylog cluster nodes is receiving logs, but not forwarding them to ES.
The “nodes”-overview is showring:
graylog01: In 200 / Out 0 msg/s
The journal contains 2,179,453 unprocessed messages in 20 segments
graylog02: In 300 / Out 200 msg/s
The journal contains 21 unprocessed messages in 1 segment. 7 messages appended, 7 messages read in the last second.
graylog03: In 100 / Out 160 msg/s
The journal contains 28 unprocessed messages in 1 segment. 28 messages appended, 28 messages read in the last second.
Heap space is not exhausted and server.log is not showing anything. I did delete the journal and it seems to be working, now.
I’m sure most of us here use multiple INPUTs and as you noticed the Raw/Plaintext UDP /TCP
accepts different formatted messages. I personally don’t use one type of INPUT for every device. For example:
Switch devices use Raw/Plaintext UDP
Firewall Devices (exception if they can send a syslog format) Syslog TCP/UDP, If not, I put them on a Raw/Plaintext UDP INPUT
Windows Devices uses either Beats, or GELF TCL/TLS
Linux Devices Either Beats or GELF TCP/TLS
Why do I do this?
Because not only is it easier to allow an input to accept messages/logs from those device but I get the fields I need or I’m able create fields just for those devices.
Not only do I separate the devices by input BUT I also use different ports.
NOW , as for your original question. Missing Messages.
You have some resource issues for sure, Either you have a ton of physical CPU cores and your Graylog Server configuration file is misconfigured, meaning “process buffer/s” or you do need more physical CPU core added to Graylog. It also could be that your elasticsearch is no longer working.
Unfortunately I cant tell where your issue is directed at. You would need to share more information on your environment for us to help you.
Sorry for the late reply. It happend one more time. All my three nodes did stop processing any logs in the journal. The journal did just get bigger while not sending anything to ES:
Here the details:
I did set up a graylog 4.2.5 cluster with:
3x GL+MongoDB-nodes (6GB heap, each), 8 cores
3x ES-nodes (30 GB heap, each), 8 cores
Log volume is currently about 20 GB/d and the system is not under much load.
Filesystem is not full, ES-cluster is “green”.
After some days runtime, all the GL nodes stop processing logs.
→ In 103 / Out 0 msg/s. The journal contains 1,369,976 unprocessed messages in 11 segments. 88 messages appended, 0 messages read in the last second.
The only way out is, to delete the journal and restart graylog.
I found that message, but I am not sure, if it is the reason and why it happens:
2022-01-18T10:00:24.263Z WARN [MappedMessage] Could not transform CEF field [type] according to standard. Skipping.
2022-01-18T10:02:28.820Z ERROR [PivotAggregationSearch] Aggregation search query returned an error: Elasticsearch exception [type=index_not_found_exception, reason=no such index ].
What I do not understand: If there would be too few output/processing buffers, IMHO the journal should slowly fill because more messages are incoming than going out.
And CPU-percentage is only 5-6%
But the behaviour is different: The system stops processing. Journal is filling up, while NO messages are reaching elasticsearch.
So:
Do you think, my config is o.k. for 8 physical cores?
Can you give me a hint, why graylog could completely stop processing instead of getting “slow”?
Is there anything else of my configuration, that is important for that case except cluster layout and buffer config?
Journal is constantly filling and none of the messages are reaching output.
I tried to increase outputbuffer_processors and processbuffer_processors and the number of CPUs to 16, but the problem is still the same:
Journal is filling up, but nothing is sent to ES.
I also tried to increase output_batch_size from 500 to 5000 but same problem…
I also stopped the incoming messages, but although there is absolutely zero input, the journal size does not get smaller.
server.log does only show:
2022-01-19T13:58:59.454Z WARN [LocalKafkaJournal] Journal utilization (101.0%) has gone over 95%.
2022-01-19T13:58:59.455Z INFO [LocalKafkaJournal] Journal usage is 101.00% (threshold 100%), changing load balancer status from ALIVE to THROTTLED
2022-01-19T13:59:59.453Z WARN [LocalKafkaJournal] Journal utilization (103.0%) has gone over 95%.
Currently, my only way out seems to be deleting the journal again.
Maybe post (obfuscated and formatted with </> tool) the configuration (/etc/graylog/server/server.conf) file for graylog01… the one that is not working… and perhaps the configuration of one that is working? If there is not a misconfiguration there, it is possible that that graylog01 is capturing a message and getting into a GROK loop or similar.
random thought - check to make sure that Graylog01 knows about your entire elastic cluster on ITS DNS…
Older content packs often have compatibility issues. I would examine each extractor and test to see how they do and enable them individually… kind of a pain for that many!! Perhaps enable in binary search manner.
If you look at the node (System->Nodes) then click on more actions… you can check process buffer dump… in the past I have seen things caught in a loop there that helped point to where I needed to fix… still hard to tack down though.
I notice you only have 2 shards and 1 replica… although this is only initial settings in the conf file. It is likely you changed in the GUI to allow data to spread more evenly (just making sure)
Hello,
A couple things I did notice in your Graylog Configuration file. I’m comparing my lab Graylog server which does around 500 mps and about 30-35 GB logs a day. And I’m running one single graylog server.
To avoid other problems probably raise this to 2 or 3 GB. 100MB can fill up quick which is might be the reason why you have in 20 segments.
Next,
I’m assuming this below is not the exact configuration. I would only use that setting (0.0.0.0) if I have one server for testing, maybe or the Graylog server does not have a static IP Address.
Next,
These three setting should not be greater then the Physical CPU cores you have. In your configuration file shows that your trying to create 22 cpu threads from 8 CPU cores. Problems may occur.
Holy cow, that’s a lot of extractors which may require more resources like CPU. After looking over your configurations and what you installed on your Graylog cluster this is starting to look like to me is when your having a problem/issue you raise the settings to much or max them out to resolve your issue. Sometime its a fin balance between what you want and what you have to work with.
What I would do and I think your doing it is disable your Content Pack from here ( which should disable all those extractors)
Next try setting all your graylog server file back to the original setting before you started trouble shooting this issue. The maybe adjusting them to appropriate setting from the amount of resources you have on each Graylog Server.
Ingest some logs and watch
Hope that helps, and keep us updated like to know how you resolved this issue
Thank you for your great post.
I think, the initial issue is related to the extractors of the content pack. Since I removed it, the issue did not come back.
About your other points:
processors: I did change the config yesterday (increaed hardware-core-count and config). Sorry for the confusion. Now I went back to: 4(processor), 2 (input) ,2 (output)-processors and 12 cores
ES-nodes
The nodes have 64GB memory and 32GB heap
journal size
Is increased to 3GB
http_bind_address = 0.0.0.0:9000
Can you explain me, why this is a problem? Why does it matter? I have to bind the interface IP to be able to communicate within the cluster. Wouldn’t a change just exclude “loopback”?
What do your buffers look like when your journal fills up? Based on what they are doing (or not doing) would help identify what/where you need to troubleshoot.
Here’s a post I wrote a while back on Journal filling up that might help with understanding the message ingestion cycle.
Couple additional thoughts…
min/max for the java heap size should be the same.
heap size should not be more than 50% of system memory or 32GB, which ever is smaller.
at 20GB/day, you might consider raising your journal to something more than 3GB. perhaps 10GB? That will give you at least half a day in case something happens to message processing.
Always leave at lest 1 core for your system. you have 12, allocate no more than 11. you have 8 which is plenty for the load you have.
consider the infrastructure, what kind of storage do you have? Single Drive? RAID? SAN? SSD? HDD? this can make a big difference once your ingest starts going up or you start having more people in the system running queries, etc.
Input-buffer was empty, output, too. Processing buffer was 100% full, but not processing any messages.
Check
Check, already set up.
Check, 10 GB
Check → 12C
Currently its only one SSD per Node, ES-cluster is currently showing VERY few I/O and nearly no I/O-wait. But I can add additional disks, if I/O will become a bottleneck.