Greetings, as stated above, I recent upgraded to Graylog 6.3.1. It has not been a smooth update and I have been ironing out a log of little issues for the past week or so.
However, the issue I can’t seem to resolve is the Office365 Audit Log input issue. Ever since I upgraded, the input no long writes to the index.
I have created new inputs, streams, adding extractors, taking away extractors, and haven’t had any luck. There are no immediately apparent errors within either the graylog server logs or opensearch cluster logs.
What happens if you look at the input diagnosis screen for that input (its on the “more options” drop-down next to the input. Do you see anything under message errors etc.
The input has been running months and months but the issue just recently occurred. Ive been thinking a timestamp issue would explain it but I’m just not sure where the issue would arise. Its a RAW TCP input and Graylog doesn’t modify the timestamp as far as I know. At minimum it should have logs spread out across time even if the timestamps were wrong.
Okay, Ive dug deeper and have a HUGE problem: Graylog is neither using its Input buffer nor its Output Buffer but the Processing Buffer is pegged at 100%. Further, I have approximately 4.7 million unprocessed messages. Graylog runs in a LXC and I have maxxed it out to 16 vcpus and 32GB memory. Heap and Garbage collection is temporarily set at
How many messages per second are you ingesting, and how much processing are you doing? A machine of that size should easily process 10-20k messages per second, BUT some really intensive pipeline rules can kill that number very fast.
So I disabled all pipelines and re-jiggered a ptr data adapter and was able to catch up quickly (i.e. no backlog). So now I suppose the next step is to figure out which pipeline is causing so many issues for graylog.
For anyone that comes here with a similar issue, I believe the solution for me was that I had my ptr lookup data adapter pointed at some stale dns servers. I had two of my largest streams - Fortigate Firewall and my M365 Audit log - utilizing that adapter. I believe the 4MM backlog was a result of each and every attempt to utilize the ptr data adapter timing out. It does not take long for an unwatched stream to backup.
Glad you found it, yes lookup adapters are powerful tools, but if you have them working on a lot of messages the performance of those can be very important. Caching can help with this often, but often caches do not work well on IP type lookup because there are so little repeats to get the benefits of caching.