Fairly generic question. We have a multi-node GL setup that we are currently testing (v3) and we have a strange issue where 1 of the nodes sometime stops processing, the process buffer maxes out, and it just sits there. Restarting GL on that node fixes the issue but then it can reappear randomly in the future.
As we are testing and trying different things I can’t say it is 100% only this node (we have log LB’s that will spread the load though they tend to prefer specific ones) but it seems to be.
Question is, what is the best method to try and narrow down what the cause could be? We monitor the high level metrics (using Telegraf\Grafana etc) - CPU, memory, JVM, in\process\out buffers, journal size etc and all seems to be in order until we get this out of character issue (interestingly the CPU on the node does not max out when it occurs, whereas normally CPU is the bottleneck at high processing load).
Because its a random question not looking for specific advice, but any pointers as to specific sub-metrics to monitor which may help ID specific processes being the cause would be helpful.
Note that we did see this sort of thing quite a lot when we were using the DNS resolver lookup feature but turned it off (we process circa 4k logs\sec at 95th percentile) .