Graylog 4.0.15 Freezes proxmox host randomly

Hello guys, i need your help. Im struggling with an incident on my prod env.
I have a HPE ProLiant ML350 Gen9 64GB RAM 16CPU. I have 2 VMs installed in Proxmox, Graylog 4 in SSD and Windows Server 2016 DC.

Graylog VM specs:

1

I have 78 configured inputs in Graylog, windows (GELF TCP), linux (syslog), mikrotiks etc.
Graylog is fully configured with extractors, pipelines and everything i need for my prod env, suddely last month it started something weird at random times, maybe twice a week or trice a week, it freezes the entire proxmox server and i need to do a cold boot on the server to restart.
For windows machines i use NX-Log CE to forward the logs.
I dont remember changing something that might cause this. Its a big problem, because i have a domain controller on the host too.

Graylog conf:


is_master = true

node_id_file = /etc/graylog/server/node-id

password_secret = *****************

# The default root user is named 'admin'
#root_username = ****

root_password_sha2 = *******************

root_email = ******************

root_timezone = Europe/Amsterdam

bin_dir = /usr/share/graylog-server/bin

data_dir = /var/lib/graylog-server

plugin_dir = /usr/share/graylog-server/plugin

http_bind_address = 0.0.0.0:9000

elasticsearch_max_total_connections = 220


rotation_strategy = count

elasticsearch_max_docs_per_index = 20000000


elasticsearch_max_number_of_indices = 20

retention_strategy = delete

elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog

allow_leading_wildcard_searches = false

allow_highlighting = false

elasticsearch_analyzer = standard

output_batch_size = 500

output_flush_interval = 1

output_fault_count_threshold = 5
output_fault_penalty_seconds = 30

processbuffer_processors = 5
outputbuffer_processors = 3

processor_wait_strategy = blocking

ring_size = 131072

inputbuffer_ring_size = 131072
inputbuffer_processors = 3
inputbuffer_wait_strategy = blocking

message_journal_enabled = true

message_journal_dir = /var/lib/graylog-server/journal

lb_recognition_period_seconds = 3

mongodb_uri = mongodb://localhost/graylog

mongodb_max_connections = 1000

mongodb_threads_allowed_to_block_multiplier = 5

transport_email_enabled = true
transport_email_hostname = *****
transport_email_port = ***
transport_email_use_auth = true
transport_email_auth_username = **************
transport_email_auth_password = ***********
transport_email_subject_prefix = [graylog]
transport_email_from_email = ***************

http_connect_timeout = 10s

proxied_requests_thread_pool_size = 32

versionchecks = false

Elastic search Heap conf:

-Xms10g
-Xmx10g
## GC configuration
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly
## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
# 10-13:-XX:-UseConcMarkSweepGC
# 10-13:-XX:-UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30

## JVM temporary directory
-Djava.io.tmpdir=${ES_TMPDIR}

## heap dumps

# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError
# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=/var/lib/elasticsearch

# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log

## JDK 8 GC logging
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m

# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m

Graylog Heap:

# Path to the java executable.
JAVA=/usr/bin/java

# Default Java options for heap and garbage collection.
GRAYLOG_SERVER_JAVA_OPTS="-Xms4g -Xmx4g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:-OmitStackTraceInFastThrow"

# Avoid endless loop with some TLSv1.3 implementations.
GRAYLOG_SERVER_JAVA_OPTS="$GRAYLOG_SERVER_JAVA_OPTS -Djdk.tls.acknowledgeCloseNotify=true"

# Pass some extra args to graylog-server. (i.e. "-d" to enable debug mode)
GRAYLOG_SERVER_ARGS=""

# Program that will be used to wrap the graylog-server command. Useful to
# support programs like authbind.
GRAYLOG_COMMAND_WRAPPER=""

Hope you guys help me solve this issue.

It could be an errant GROK on a message that has changed slightly - here is a relevant post to that:

As i have noticed some times this happens in the morning when all employees come to work and turn on their computers. The indicator in the top left corner
22

indicates 3000+ in and 3000+ out

At this time the entire proxmox host freeze (not always)

I dont see anything in graylog server logs after reboot, the log just stop at the time of freeze and resumes after reboot.
Could an errant GROK freeze the entire host?

I have a MUCH smaller environment and company and for me it was a GROK pattern that wasn’t matching the string properly and got lost… could have been backtracking like mentioned in the other post… though it was a simple change for me on a field match of IP and/or IPORHOST. GROK is helped greatly if you can pin the search to the beginning of the message with ^ and even better to the end with $. This keeps it from trying the pattern across the whole message before failing out. It was a combination of watching on what message the logs stopped on and looking at the process buffers to see if they listed anything (usually they say idle or something like that)

I also looked the proxmox graphs for any peaks on freeze time but nothing abnormal. The graphs are still, no cpu or memory pekas to indicate something. I will check all the grogs again for any change.

When my system was locked up, I don’t recall it topping out memory or CPU. I don’t know for sure at all if it is your issue, just one possibility of many.

Take a look at graylog and jvm config for any missconfiguration i have missed. I will post the nxlog config tomorrow when i get to work

Hello,

Just chiming in. I noticed you stated this when employees come to work.

Windows has been known to create message storms. If your GL freezes up during that time, have you checked your buffers when this issue occurs? If they are at 90-100% its possible that it will make your CPU/Memory will spike.

Nope no spikes of memory or cpu. Also the buffers are not filling up at that moment. I has happened at nights too so not only when working hours start. How its possible to freeze the entire host, and not only the GL VM. Maybe can i change something in the configs to test it out? Any idea?

OK i think i might have a clue.
I audit file systems on my windows machines, but now i remember having a big storm of messages “An attempt was made to access an object.” AUDIT SUCCESS when the antivirus scan kicked in. The AV scan starts on most of the devices at the same time, so GL is flooded with Event ID 4663.
Not knowing how to drop this message from the source PC and NX Log, i wrote a pipeline in GL to drop messages with Event ID 4663 and if proccess name was my antivirus accessing the files.
So far the rule works not showing me those messages anymore, but now that im seeing Process-buffer dump, i see those messages coming in and being processed before being dropped by GL.

Hello,

AUDIT SUCCESS Event ID 4663 ( An attempt was made to access an object)
I have this in my lab , for security measures.

To be honest if this is messing things up for you I would look into the source of this, instead of trying to patch it with pipelines, just an idea.

I never had VM freeze up my host server. What I have had was too many virtual machine on one host, but my Hyper-v server/s are configured to put VM’s in a paused stated so the host would not crash (freeze up).

EDIT:
If your using Nxlog-ce this cant be configure like so…

<Input zone-01>
    Module      im_msvistalog
    Query <QueryList>\
    <Query Id="0">\
    <Select Path="Application">*</Select>\
    <Select Path="System">*</Select>\
    <Select Path="Security">*</Select>\
    </Query>\
    </QueryList>  
</Input>

Or something like this

<Input  zone-02>
    Module      im_msvistalog

    Exec if ($EventType == 'VERBOSE') OR ($EventType == 'INFO') OR ($EventType == 'AUDIT_SUCCESS') drop();
    Exec if ($SourceName == 'Microsoft-Windows-KnownFolders' AND $EventID == 4663) drop();
</Input>

Adjust you query to prevent those messages from reaching GL.

1 Like

Hello, and thanks for the reply. This is the solution, reducing and filtering the messaes from the source. Thanks again.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.