Hello guys, i need your help. Im struggling with an incident on my prod env.
I have a HPE ProLiant ML350 Gen9 64GB RAM 16CPU. I have 2 VMs installed in Proxmox, Graylog 4 in SSD and Windows Server 2016 DC.
Graylog VM specs:
I have 78 configured inputs in Graylog, windows (GELF TCP), linux (syslog), mikrotiks etc.
Graylog is fully configured with extractors, pipelines and everything i need for my prod env, suddely last month it started something weird at random times, maybe twice a week or trice a week, it freezes the entire proxmox server and i need to do a cold boot on the server to restart.
For windows machines i use NX-Log CE to forward the logs.
I dont remember changing something that might cause this. Its a big problem, because i have a domain controller on the host too.
## GC configuration
## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
## JVM temporary directory
## heap dumps
# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
# specify an alternative path for JVM fatal error logs
## JDK 8 GC logging
# JDK 9+ GC logging
# Path to the java executable.
# Default Java options for heap and garbage collection.
GRAYLOG_SERVER_JAVA_OPTS="-Xms4g -Xmx4g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow"
# Avoid endless loop with some TLSv1.3 implementations.
# Pass some extra args to graylog-server. (i.e. "-d" to enable debug mode)
# Program that will be used to wrap the graylog-server command. Useful to
# support programs like authbind.
I have a MUCH smaller environment and company and for me it was a GROK pattern that wasn’t matching the string properly and got lost… could have been backtracking like mentioned in the other post… though it was a simple change for me on a field match of IP and/or IPORHOST. GROK is helped greatly if you can pin the search to the beginning of the message with ^ and even better to the end with $. This keeps it from trying the pattern across the whole message before failing out. It was a combination of watching on what message the logs stopped on and looking at the process buffers to see if they listed anything (usually they say idle or something like that)
Just chiming in. I noticed you stated this when employees come to work.
Windows has been known to create message storms. If your GL freezes up during that time, have you checked your buffers when this issue occurs? If they are at 90-100% its possible that it will make your CPU/Memory will spike.
Nope no spikes of memory or cpu. Also the buffers are not filling up at that moment. I has happened at nights too so not only when working hours start. How its possible to freeze the entire host, and not only the GL VM. Maybe can i change something in the configs to test it out? Any idea?
OK i think i might have a clue.
I audit file systems on my windows machines, but now i remember having a big storm of messages “An attempt was made to access an object.” AUDIT SUCCESS when the antivirus scan kicked in. The AV scan starts on most of the devices at the same time, so GL is flooded with Event ID 4663.
Not knowing how to drop this message from the source PC and NX Log, i wrote a pipeline in GL to drop messages with Event ID 4663 and if proccess name was my antivirus accessing the files.
So far the rule works not showing me those messages anymore, but now that im seeing Process-buffer dump, i see those messages coming in and being processed before being dropped by GL.
To be honest if this is messing things up for you I would look into the source of this, instead of trying to patch it with pipelines, just an idea.
I never had VM freeze up my host server. What I have had was too many virtual machine on one host, but my Hyper-v server/s are configured to put VM’s in a paused stated so the host would not crash (freeze up).
If your using Nxlog-ce this cant be configure like so…