We have a single node of 8 cores and 32gb that processes on average 1-2k messages per second. The node handles everything really well but then something happens and it just stops message processing. Theres nothing in any of the logs and no other indication as to why this happens.
In some cases the journal seems to be getting corrupted. Graylog won’t process any messages for a few hours and only restarts when I remove the journal. Has anyone come across this before? Are there any know fixes for this?
The node is hosted in Azure and we had to stop the machine and resize the data disk on it, this data disk held the journal. Could the resizing have corrupted it?
Is it just the architecture? (we’re looking to cluster but trying to work out load balancing)
Any other ideas anyone?
regards,
G