Hi.
On productive stand have 2 VM (4cpu, 12gb ram in each).
Graylog+elasticsearch+mongodb on each node and all in the docker.
All settings are set to default except:
ES_JAVA_OPTS: -Xms4g -Xmx4g (for elasticsearch)
GRAYLOG_OUTPUT_BATCH_SIZE: 1000 and GRAYLOG_OUTPUTBUFFER_PROCESSORS: 4 (for graylog).
NTP is correctly working on all nodes.
Elasticsearch cluster health is green.
Pipeline and extractors is not set.
In graylog have problem with too many uncommited messages:
We have a similar problem, but on a physical Dell server, 96Gb memory, 24 processors.
I’ve been alll over the tuning guidelines to do the best I can with it, but have found that it works normallly for a few days, then the message queue starts to back up, and eventually we lose messages.
Elasticsearch memory is not swappable, and is set to 24g, the system is not paging, but a major sign is both the elasticsearch and graylog CPU usage spikes up.
What one of my database team suggested for me, and it seems to work, is restarting the processes to clean up the state of the JVM - I think the garbage collect is starting to thrash.
Our system is processing on average 80000 messages / minute, and storing around 50Gb data / day.
For forward planning, I’m assuming a second graylog node would be more helpful than separating graylog from ES, is that correct?
No, running Graylog and Elasticsearch on separate machines is more important. Otherwise they’ll compete for the same resources (CPU, memory, I/O bandwidth and disk cache) which leads to cache thrashing.
After increasing resources on each VM (4 cpu -> 6 cpu, 14gb ram -> 16gb ram) and adding some memory to es (ES_JAVA_OPTS: -Xms4g -Xmx4g --> -Xms6g -Xmx6g) everything became fine.
For example to other posters: cluster with 2 VM (esrch + graylog + mongo on each) processes about 100-120gb logs in day and 3000-5000 messages in a second during peak hours.