We have a 3 node graylog cluster that I upgraded from 2.2.3 to 2.3.1. After the upgrade the web interface was noticeably slower. Especially so on the Search page and the Sources page.
Last time I tried loading the Search page it took 12 seconds but the search result is saying it found Found 39,207 messages in 342 ms, searched in 231 indices for the last 5 minutes. The Sources page is taking roughly 6 seconds to load and the two graphs have the spinning icon until it loads. Both these pages loaded almost instantaneously before the upgrade.
I thought it might be due to the load balancer haproxy that points to nginx. I took both out of the equation and the speeds remained the same.
I then upgraded elasticsearch from 2 to 5. Still the same result.
Our setup is as follows:
3 graylog VMs running Graylog and Mongodb
20 CPUs
14GB of ram
Centos 7
Graylog 2.3.1+9f2c6ef o
Linux 3.10.0-693.2.2.el7.x86_64)
Oracle Corporation 1.8.0_144
openjdk version "1.8.0_144"
OpenJDK Runtime Environment (build 1.8.0_144-b01)
OpenJDK 64-Bit Server VM (build 25.144-b01, mixed mode)
The Indices/Index are set to
Shards: 4
Replicas: 2
Each elasticsearch node has roughly 250GB of data.
The hardware behind this setup is brand new and not being taxed at all. The SAN iops are hardly being touched. This setup is only receiving roughly 100 messages/sec.
Here is a link to our config for the master graylog node. The other 2 are identical except where node1 needs to be node2, etcā¦ And only node1 is master.
One thing that Iāve noticed is that when I load one of those pages and watch top on the command line I see mongod jump to 50% plus until the page loads. Not sure if thatās normal.
Checked the elasticsearch heap and it shows the following which seams fine to me.
Completely cloned our environment to Dev and isolated the environment to itās own closed network.
First thing I tried is set all the indexes to only keep one and rotated them to clear out almost all data except for the data from the closed environment feeding in still. Didnāt see much difference in performance.
Rolled back the changes and tried downgrading mongodb. Didnāt notice any substantial difference. I didnāt notice mongod spiking on page load but that might be because the dev system is relatively idle.
Disabled tls, nginx, haproxy and loaded right from the graylog http page. No difference.
Tried reinstalling the Graylog rpm, no change.
Iām not seeing anything in the logs, the buffers are always empty, not sure what else to look for.
It seems like the graph processing on the search and sources page may be the culprit, but I could be off base.
Hoping @Jan, @jochen, or one of the other awesome people here have some insight on what to do next.
Edit: I believe itās something to do with the calls being made on those pages, possibly something to do with the 4096 regression in the prior 3.0 release and how it was fixed. I tried rolling back to that release but it had the 4096 errors. I noticed on the current release when I search in a stream everything loads quick including the graphs.
I had checked my Lab and the following Version of (openJDK) are installed, but I did not see any errors. But I could notice that the Interface feels slower. From time to time. Will check if this depends on the Host in the Cluster where the LB connects me to and if that feeling is different between the 3 Servers.
Thank that you bring this is up we will investigate.
Ubuntu 16.4 LTS
java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
Debian 8
openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2~bpo8+1-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)
CentoOS 7
openjdk version "1.8.0_144"
OpenJDK Runtime Environment (build 1.8.0_144-b01)
OpenJDK 64-Bit Server VM (build 25.144-b01, mixed mode)
@jan Upgraded to the latest version today, still no change. I did notice that on the Search page and Source page it will take 15 seconds to load, but in a stream I can run a search for the past 30 days and return the following with a 3 second page load.
Found 107,573,655 messages in 591 ms, searched in 23 indices.
Results retrieved at 2017-10-19 17:30:11.
Iām wondering if it has something to do with my streams. We have a little under 50 streams each with its own index.
Edit: Going to try changing my ideces that had been set to 30 day to 1 day and see if that makes any difference.
017-10-20T12:55:16.462-04:00 ERROR [UsageStatsClusterPeriodical] Uncaught exception in periodical
org.graylog2.indexer.ElasticsearchException: Fetching message count failed for indices [2n-devices_1, 2n-devices_8, 2n-devices_1
ā¦
An HTTP line is larger than 4096 bytes.
I tried setting http.max_initial_line_length: 64k in /etc/syconfig/elasticsearch but that doesnāt appear to have worked as it might not be the right place to set it. I tried setting it in /etc/elasticsearch/elasticsearch.yml but then elasticsearch doesnāt want to start.
Does anyone know how to set this on a centos/rhel system? Running elasticsearch 5.6.
Thanks Jochen. Do you know if this might explain the weirdness Iām seeing with slowness mentioned above?
Itās weird because in a stream searching works as expected quickly. But from the search tab itās 15 seconds to load. Iāve been wondering if this is related to the thread you mentioned that Iāve been following.
I had used the dev tools from chrome and ff a while back but didnāt see anything standing out. The one thing that appeared to be the slowest was the bar chart. If I loaded the page I would see the search results show up in 4 seconds and the bar chart might take 12 seconds to finish.
This seems to be resolved now. I ended up basing our indeces on size for rotation and greatly reduced the total number we had. This was something I had been meaning to do regardless of the issue we had to avoid something like a dos filling up our disks.
Reducing the total number of indices seems to have made the biggest impact and things seem to be working at regular speeds now. It is strange that we didnāt have this issue before the upgrade which makes me believe it has to do with how the calls are being made over http to elasticsearch.