Updated to 6.3.1 and my Office365 Input (RAW TCP) no longer writes to index

Greetings, as stated above, I recent upgraded to Graylog 6.3.1. It has not been a smooth update and I have been ironing out a log of little issues for the past week or so.

However, the issue I can’t seem to resolve is the Office365 Audit Log input issue. Ever since I upgraded, the input no long writes to the index.

I have created new inputs, streams, adding extractors, taking away extractors, and haven’t had any luck. There are no immediately apparent errors within either the graylog server logs or opensearch cluster logs.

I do know Graylog receives the logs because:

But:

Any thoughts?

Thank you!

Edit: Graylog Diagram
Opensearch:

  • Manager 0

  • Manager 1

  • Manager 2

  • HotData 0

  • HotData 1

  • HotData 2

  • ColdData 0

    Mongo:

  • mongodb 0

  • mongodb 1

  • mongodb 2

    Graylog:

  • graylog server 0

What happens if you look at the input diagnosis screen for that input (its on the “more options” drop-down next to the input. Do you see anything under message errors etc.

No errors and, oddly enough, now this, a huge message dump (despite the remote data provider running every 15 minutes):

How long ago did you start the input, could this actually be a timestamp issue where the messages are being ingested but show at an incorrect time?

The input has been running months and months but the issue just recently occurred. Ive been thinking a timestamp issue would explain it but I’m just not sure where the issue would arise. Its a RAW TCP input and Graylog doesn’t modify the timestamp as far as I know. At minimum it should have logs spread out across time even if the timestamps were wrong.

On raw the timestamps should be fine. Are all other logs okay? Have you tried turning off all processing of these messages and seeing what happens?

In recreating the stream, is it for sure now writing to the m365 index?

Try recalculating the index range under maintenance.

Now that you mention it, it looks like all my logs are 4 hours behind,

For instance, something that happens at 08:45 EST, it shows in Graylog as 04:45.

Okay, Ive dug deeper and have a HUGE problem: Graylog is neither using its Input buffer nor its Output Buffer but the Processing Buffer is pegged at 100%. Further, I have approximately 4.7 million unprocessed messages. Graylog runs in a LXC and I have maxxed it out to 16 vcpus and 32GB memory. Heap and Garbage collection is temporarily set at

GRAYLOG_SERVER_JAVA_OPTS=“-Xms20g -Xmx24g -server -XX:+UseG1GC -XX:-OmitStackTraceInFastThrow -Djavax.net.ssl.trustStore=/etc/graylog/graylog.jks”

and my buffers are:

processbuffer_processors = 14
outputbuffer_processors = 8
processor_wait_strategy = blocking
ring_size = 262144

inputbuffer_ring_size = 262144
inputbuffer_processors = 6
inputbuffer_wait_strategy = blocking

any suggestions?

How many messages per second are you ingesting, and how much processing are you doing? A machine of that size should easily process 10-20k messages per second, BUT some really intensive pipeline rules can kill that number very fast.

So I disabled all pipelines and re-jiggered a ptr data adapter and was able to catch up quickly (i.e. no backlog). So now I suppose the next step is to figure out which pipeline is causing so many issues for graylog.

For anyone that comes here with a similar issue, I believe the solution for me was that I had my ptr lookup data adapter pointed at some stale dns servers. I had two of my largest streams - Fortigate Firewall and my M365 Audit log - utilizing that adapter. I believe the 4MM backlog was a result of each and every attempt to utilize the ptr data adapter timing out. It does not take long for an unwatched stream to backup.

Glad you found it, yes lookup adapters are powerful tools, but if you have them working on a lot of messages the performance of those can be very important. Caching can help with this often, but often caches do not work well on IP type lookup because there are so little repeats to get the benefits of caching.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.