We recently installed Graylog and I cannot for the life of me get the logs into elasticsearch quick enough so our unprocessed message count keeps increasing.
We have a CentOS7 server. 8x CPU. 64GB Memory. 20GB assigned to the Graylog Heap. 20GB assigned to the elasticsearch heap.
My server.conf file had these settings
processbuffer_processors = 5
outputbuffer_processors = 3
inputbuffer_processors = 2
I found another post which indicated to tinker with these but changing the values (and restarting the service) hasn’t helped.
I’m quite new to Graylog so any support would be great (and please dumb stuff down for me as much as you can to get me back onto the right track).
I guess your Elasticsearch and your Graylog are fighting for CPU that is why it can’t get into speed.
You would need to lower your ingest volumen or add another server with the same specs dedicated to elasticsearch … or add another that is a little smaller dedicated to graylog.
Thanks for the advice Jan. Is there any other tweaks I can make?
I did a top and noticed that elastic/java is wanting 233g of virtual memory. Is that normal or would that be solved with more vCPU / splitting my graylog and elastic out?
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3193 graylog 20 0 26.4g 20.6g 9872 S 351.5 32.9 2533:28 java
3189 elastic+ 20 0 233.7g 20.5g 144096 S 255.8 32.7 1633:08 java
3315 mysql 20 0 21.3g 717108 7372 S 16.6 1.1 52:15.20 mysqld
Check also your disk subsystem, if it has enought I/O for such volume of data to written. I see mysql running on same server, that can consume o lot of I/O if it’s high stressed. How powerfull is your disk subsystem? How many disk, raid cache, do you use localdisk/san/nas?
If you use a lot of generic extraction rules, try to move it to pipiline rules and separate types of log by pipeline and rules for specific application/program/device to lower cpu usage of graylog.
Thank you, our disks are hosted on a SAN, and I think we get good performance out of the disk array but I will look into it. I must admit it is shared along with other stuff, but I’ll get some perf tests for it.
I think we use Windows and Cisco extraction tools but I’ll look at tuning that. I think we’re getting bombarded with Windows events too so I think I need to tune them away so I’m getting just what I need for now.