The other day i did the following upgrades to our single node setup, using apt repositories and keeping all the old config files:
Graylog 3.3 → 4.0.5
Elasticsearch 6.8.14 → 7.10.2
The node was originally installed from the OVA appliance download, and this is the first time its updated.
After the upgrade the node seems to be working fine, but i noticed that when i stop the elasticsearch service the messages no longer go to the output buffer and wait, as they did before the update?
Now they only count up in the journal, and as far as i can see all messages are lost from the timeframe when elasticsearch was stopped until it is started again?
I have a single node that was hand installed, running at home. On the same versions. This one also dosent buffer messages when elasticsearch is stopped and messages are subsequently lost in the timeframe if elasticsearch is restarted or down.
Also in htop the VIRT size of elasticsearch has changed from being about the size of all indices combined, to just around half the size, i dont know if this is anything to be worried about but i just noticed it, what could be the reason for this?
The CPU usage of the machine has also been cut in half compared to before the upgrade which is nice, while the amount of messages being processed is still the same.
Journal holds messages before they can be written to Elasticsearch.
I stoppped my ES and Journal fills up, This is expected. As for your buffers, these are utilized for Elasticsearch. Normally when you see buffers full. I would check your GL configuration file, Elasticsearch logs and service.
Could you confirm that you can find and read the log messages that were written while elasticsearch was stopped?
I have gotten a graylog 3.3 instance running from backup, the two pictures show what i mean.
The pictures in the nodes tab was taken while elasticsearch was stopped, and the pictures that show the message stats were taken right after elasticsearch was started again.
White theme is graylog 3.3 and here elastic was stopped between 8.51 - 8.53
Dark theme is graylog 4.0.5 and here elastic was stopped between 9.03 - 9.05
As you can see outgoing returns to 0 with version 3.3 which makes sense as elasticsearch cant handle messages for obvious reasons and the buffer starts filling up, but with 4.0.5 it would seem like they are being outputted to something, but not elasticsearch as it is not running when the picture is taken…
Waited a couple minutes for the journal to fill up, Output buffer was at 0%. Started ES again.
Output buffer went to 14%. I did check the time when ES was stopped and matched it with the timestamp posted, it indicating the messages from other servers did reach Graylog server within that outage time.
Is your Elasticsearch healthy and able to cope with the message throughput?
Did you see anything in Graylog/Elasticsearch log files?
I found that the Output buffer filling up, this is either indicative of elasticsearch issues, or insufficient output processors.
Normally when ES is too slow the output buffer starts increasing in %. I have stopped ES many times in the past but never seen the OutPut buffers fill up when it was stopped or I just never noticed it. My ES always looks like the picture from my first post here when I stopped ES service.
@gsmith So you have messages in the message counter under the search tab from the timeframe when elasticsearch was stopped, and you can read these messages when ES is started again, that were receieved by graylog while ES was down?
I dont think that this is a performance issue with elasticsearch, as it is easily able to handle the normal daily load, around 250-500 messages/second and all buffers are always at 0%
In version 3.3 if ES had been stopped and started again after letting the buffer fill up for a bit i would sometimes see 5000-7500 messages/second out and the output buffer would clear very quickly, which i would say would indicate that elasticsearch is performing well.
What i need to find out is what has changed after the update to 4.0.5 that makes it so that the messages are lost while elasticsearch is stopped or restarted, which didnt happen before the update in 3.3. Nothing has changed in my configurations so something must have changed with the update.
Could you also explain what in/out means, because i would guess that out means that the messages are written to elasticsearch, but that should not be possible if the service is not running… Which was what i saw with the old Graylog 3.3 instance, where out would go to 0 if ES was not running… but i might be wrong.
As for the GL/ES log files i have no idea what to look for, nothing comes up there that i would deem the cause of my issue…
Im sorry to see your having trouble with upgrading Graylog, but at this point not knowing how you installed your graylog envirmonment fully and configuration/s it hard to troubleshoot your issue. I see your using OVA, I havent used that type of installation before so I’m unsure.
Yes in/out are message coming in and being processed.
To be honest I havent had or seen any problems with my Graylog server losing messages. Maybe its just the way I have my graylog server configured. I have CentOS 7, All in one server on VM. I push 30GB of data a day using TCP/TLS. This server has been update since version 1. I have added more resource and storage in the past but that is it. So I would assume if you losing messages it would be some type of confiuration or something over looked?
The behaviour seems to change between versions graylog-3.3.9-1.ova (updated via apt to 3.3.11) and graylog-4.0.0-5.rc.1.ova. I cannot seem to find differences in the GL/ES config files between these two that would affect the journaling/buffering behavoiur… The only difference i have spotted is that 3.3.11 dosent have the ES Support plugins installed, while 4.0.0 does have them, i have tried removing them from 4.0.0 but then graylog dosent want to start… so no luck there. The changelog dosent provide any info regarding changes that could have anything to do with this either. Could there maybe have been implemented something by accident between the two versions that breaks the buffering?
This might also just be a problem on ubuntu/debian, as you @gsmith dont have the problem on centos.
Also i can understand from reading around that the message processing goes something like this:
input → input buffer → journal → process buffer → output buffer → ES
following this, the behaviour from version 3.3.11 and earlier makes sense, where when ES is stopped, the journal and output buffer fills up, then the journal and process buffer and then just the journal, as the messages now have nowhere else to go, keeping in mind that they are all still present in the journal as the messages are not yet written to any other persistent memory.
Not sure if im completely right about this though…
With 4.0.0 and higher the messages do start counting up in the journal, but no matter what amount of messages are there they all vanish at the same time, when ES is started again and dont seem to be processed as is clear from the message counter that sits at 0 in that timeframe.