Since the update from Graylog 4.2.10 to 4.3.2, my Office 365 Input keeps “stopping”, at anytime and several times in a day. This is really annoying because I have some important alerts triggered by event on this input.
I have to manually restart the input, because it does not really stop but no messages are coming from it until it is restarted.
Is it a known problem ?
My environment :
OS Information: Debian 11.0.15 on Linux 4.19.0-20-amd64
Package Version: Graylog 4.3.2+313b6bc
I have 2 inputs currently running : 1 Office 365 and 1 Syslog. I use a pipeline to check if the IP addresses are bad with the Treat Intel Plugin, and another to get geographic coordinates from the GeoIP lookup.
Have anyone already encountered that issue and solved it ? If yes, I would be happy to understand why it happens !
Thanks in advance for your answers,
Best regards,
G. Morin
I recall seeing this post that suggests that you increase your polling interval. I never like increasing polling intervals because I want to know RIGHT NOW… but it is at least something to experiment with.
Well, thanks but even after desactivating all the pipeline rules and increasing the polling interval like you said the input keeps stopping. This time, it does not stop after a few minutes but after a few hours.
Is it possible it happens because I upgraded from 4.2.10 to 4.3.2, skipping 4.3, 4.3.1 and not generating a new server.conf file ?
No, nothing special on both sides. The input is still running, but nothing seems to be ingested from it. I have to restart it manually to have all the logs to be downloaded.
Yes I have firewalls but I made sure to authorize this kind of trafic. What I don’t understand is : why does it “stops” unexpectedly like this ?
If my firewalls’ configuration was bad, I think that I wouldn’t get any logs from this entry, isn’t it ?
Not very sure about what I’m saying but it seems logical for me
A tough one when you don’t have any clues showing up in the logs. You had asked about the server.conf - I am not aware of any changes that would break an input… No logs on either side say anything of import and it’s just that the Graylog Input stops working? Is it the Input or is it Microsoft stops sending? How can you tell one way or the other?
→ First of all, no log on the graylog side. When I go to the “System > Inputs” menu, the Office 365 input is not in disabled state, nor in failed state. It seems just as normal as the Syslog input that is just after the problematic one.
When I quiclky desactivate/reactivate the input manually, the logs begin to be downloaded again, like all of it is normal. Between these states, nothing from my mac address to my public ip address changes, that’s why in my oppinion Microsoft is not blocking the log downloading.
I’m a newbie in the Graylog world, and I am not very familiar with the way that graylog requests Microsoft for the logs. Assuming the number of secrets and tokens generated through the Microsoft web ui, I think it’s a REST API, and I’m not familiar with these tools for the moment.
But what is sure is that in some way the Graylog server manages to download some logs, which excludes - for me - a firewall issue.
The Graylog logs that can be watched with the command:
tail -f /var/log/graylog-server/server.log
Should show the transition of the input from started to stopped and the reverse - can you post that portion and anything else that looks like it is related?
Graylog listens for what is sent to it unless you have a plugin or a script that does otherwise. Do you have extractors running on that input - if so post up detail… there could be a regex or GROK that could hang up the Input
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
at java.lang.Thread.run(Thread.java:829) [?:?]
Caused by: java.io.EOFException: SSL peer shut down incorrectly
at sun.security.ssl.SSLSocketInputRecord.read(SSLSocketInputRecord.java:483) ~[?:?]
at sun.security.ssl.SSLSocketInputRecord.readHeader(SSLSocketInputRecord.java:472) ~[?:?]
at sun.security.ssl.SSLSocketInputRecord.decode(SSLSocketInputRecord.java:160) ~[?:?]
at sun.security.ssl.SSLTransport.decode(SSLTransport.java:111) ~[?:?]
at sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1506) ~[?:?]
... 33 more
That’s what I understood reading the documentations, but is it the same with cloud-based apps like Office 365 ? Is there no polling triggered every X seconds to Azure ?
I have this extractor running on the input composed of several grok patterns :
I also have this unique pipeline that uses the geoip plugin. I reused the one given in this tutorial, here is the code :
rule "GeoIP lookup: source_ip"
when
has_field("source_ip")
then
let geo = lookup("geoip", to_string($message.source_ip));
set_field("source_ip_geo_location", geo["coordinates"]);
set_field("source_ip_geo_country", geo["country"].iso_code);
set_field("source_ip_geo_city", geo["city"].names.en);
end
Don’t know if this has anything to do with it but your GROK statement looks as though it is missing some escape characters - any quotes should be escaped \" or if you are in a pipeline they need to be double escaped \\" … the colon as well
Graylog’s list of characters that need to be escaped:
& | : \ / + - ! ( ) { } [ ] ^ " ~ * ?
Also of note it is good form to use the ^ at the start of a GROK/regex search and sometimes even the $ at the end of a search to make sure that GROK/regex isn’t sliding it’s search around trying to fit your command in wherever possible… that could slow things down more than needed, particularly at high volume.
I just tried to run an “escaped” version against one log in th indexes to test, but it does not want to run while every characters needed to be escaped are escaped. Could you give me one example for the extractor please ?
I don’t have an example message to build/check from… You can plug it into an online GROK debugger and see what the results are. The linked one has worked well for me…
I didn’t see this before - I think I was on the train when I was reviewing… That says to me that Office 365 stopped sending in a way that pissed off the Input you have and it possibly related to ssl. I would start hunting on the Office 365 side…
Well, I have activated support for TLS 1.3 and removed SSLv3 and TLS 1.0 & 1.1, but it’s not better.
I’ve seen nothing on the Office 365 side, the parameters are very limited in the UI.
I don’t know how to proceed, it’s weird because it worked flawlessly back a month ago and I don’t really know why it crashes randomly like that. It’s so frustrating !
3.The logs you showed above is not the full message, is it possible you can show the full log file during the crash on this input and startup? This would be better in trouble shooting this issue.
4.What is the Configuration of this input? If your using the default , built-in Office 365 INPUT should look something like this.