We have old Graylog 3.3.15 setup running on AWS ECS/EC2 with AWS OpenSearch(ES 5.6) cluster behind it (yes, I know, very old versions, we plan to update it this year, but I’d like to fix the current issue).
About 2 weeks ago ES cluster went into read-only mode, because it run out of storage and before we noticed and fixed it, it was in state green, but not accepting any new messages. After space was freed, it started working again, but it caught up to about current time -4h and stays there since more than a week.
Notes:
There was no change in how messages are being sent, timestamps look correct
I’ve recalculated index ranges - it changes nothing
CPU/mem on containers and ES cluster seem to be pretty low, so it should be able to catch up to current time
Graylog is showing 2 messages constantly: Uncommited messages deleted from journal and Journal utilization is too high - I have a suspicion that it is going through them at the same rate that the new messages are being saved. I’m ok with emptying the journal, even if it means losing those messages.
I had the same thing happen, when I deleted everything in the journal and the buffers went down I noticed that it was showning - hours. I did a Graylog service restart then manually rotated my index set/s and that seamed to clear it up for me. If your using GROK or REGEX this may have cause the issue.
EDIT: Suggestion when my ES/OS ran out of room and went into read mode. I decided to make my data on a separate volume, first it was making my service sluggish and second I realized the need to expand my volume it was kind of tedious.
Example.
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /mnt/dev/sdc1/elasticsearch
#
# Path to log files:
#
path.logs: /var/log/elasticsearch
It does make it easier to expand the drive.
If that dirve is running out of room my service/s will not be affected. Not sure if you do that thou. Also easier to setup from the beginning. Same goes for my journal.
I have increased my journal size to 12 GB, not sure if that an option for you, but when the journal fills up that is a good indicator that ES/OS is either not indexing them fast enought ( i.e., Need more resources CPU) or there is a bad log/messages that is perventing ES/OS for indexing.
SideNote we started using this
It works on Opensearch and elasticsearch very good.
I have it connected to both Es & OS when logon you choose what cluser you want.