Graylog web interface is slow after upgrade


#1

Hi,

We have a 3 node graylog cluster that I upgraded from 2.2.3 to 2.3.1. After the upgrade the web interface was noticeably slower. Especially so on the Search page and the Sources page.

Last time I tried loading the Search page it took 12 seconds but the search result is saying it found Found 39,207 messages in 342 ms, searched in 231 indices for the last 5 minutes. The Sources page is taking roughly 6 seconds to load and the two graphs have the spinning icon until it loads. Both these pages loaded almost instantaneously before the upgrade.

I thought it might be due to the load balancer haproxy that points to nginx. I took both out of the equation and the speeds remained the same.

I then upgraded elasticsearch from 2 to 5. Still the same result.

Our setup is as follows:

3 graylog VMs running Graylog and Mongodb
20 CPUs
14GB of ram
Centos 7
Graylog 2.3.1+9f2c6ef o
Linux 3.10.0-693.2.2.el7.x86_64)
Oracle Corporation 1.8.0_144
openjdk version "1.8.0_144"
OpenJDK Runtime Environment (build 1.8.0_144-b01)
OpenJDK 64-Bit Server VM (build 25.144-b01, mixed mode)

3 elasticsearch VM nodes
20 CPUs
25 GB of ram
12 GB java heap
Centos 7
Linux 3.10.0-693.2.2.el7.x86_64)
elasticsearch
"number" : “5.6.2”,
“build_hash” : “57e20f3”,
“build_date” : “2017-09-23T13:16:45.703Z”,
“build_snapshot” : false,
“lucene_version” : “6.6.1”

The Indices/Index are set to
Shards: 4
Replicas: 2
Each elasticsearch node has roughly 250GB of data.

The hardware behind this setup is brand new and not being taxed at all. The SAN iops are hardly being touched. This setup is only receiving roughly 100 messages/sec.

Here is a link to our config for the master graylog node. The other 2 are identical except where node1 needs to be node2, etc… And only node1 is master.

https://pastebin.com/wGpbbkjU

Any ideas on what to do?

Thanks,
Ryan


#2

One thing that I’ve noticed is that when I load one of those pages and watch top on the command line I see mongod jump to 50% plus until the page loads. Not sure if that’s normal.

Checked the elasticsearch heap and it shows the following which seams fine to me.

curl -sS -XGET "localhost:9200/_cat/nodes?h=heap*&v"
heap.current heap.percent heap.max
3.2gb 27 11.8gb
5.2gb 44 11.8gb
3.7gb 31 11.8gb

Noticed oom-killer logs seen here.

I noticed these in my logs with 1.8.0.144 that I never noticed before. I rolled back to 141 and will check of it helped tomorrow.

Edit: It did help with the oom-messages.

Didn’t help page load speed.


#3

Completely cloned our environment to Dev and isolated the environment to it’s own closed network.

First thing I tried is set all the indexes to only keep one and rotated them to clear out almost all data except for the data from the closed environment feeding in still. Didn’t see much difference in performance.

Rolled back the changes and tried downgrading mongodb. Didn’t notice any substantial difference. I didn’t notice mongod spiking on page load but that might be because the dev system is relatively idle.

Disabled tls, nginx, haproxy and loaded right from the graylog http page. No difference.

Tried reinstalling the Graylog rpm, no change.

I’m not seeing anything in the logs, the buffers are always empty, not sure what else to look for.

It seems like the graph processing on the search and sources page may be the culprit, but I could be off base.

Hoping @Jan, @jochen, or one of the other awesome people here have some insight on what to do next.

Edit: I believe it’s something to do with the calls being made on those pages, possibly something to do with the 4096 regression in the prior 3.0 release and how it was fixed. I tried rolling back to that release but it had the 4096 errors. I noticed on the current release when I search in a stream everything loads quick including the graphs.

https://github.com/Graylog2/graylog2-server/issues/4054


(Jan Doberstein) #4

Hej Folks,

I had checked my Lab and the following Version of (openJDK) are installed, but I did not see any errors. But I could notice that the Interface feels slower. From time to time. Will check if this depends on the Host in the Cluster where the LB connects me to and if that feeling is different between the 3 Servers.

Thank that you bring this is up we will investigate.

Ubuntu 16.4 LTS

java -version
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-2ubuntu1.16.04.3-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

Debian 8

openjdk version "1.8.0_111"
OpenJDK Runtime Environment (build 1.8.0_111-8u111-b14-2~bpo8+1-b14)
OpenJDK 64-Bit Server VM (build 25.111-b14, mixed mode)

CentoOS 7

openjdk version "1.8.0_144"
OpenJDK Runtime Environment (build 1.8.0_144-b01)
OpenJDK 64-Bit Server VM (build 25.144-b01, mixed mode)

#5

Thanks Jan! If it is a code issue and you need someone to test, I have a whole dev environment now.

I tried rolling back the Java version for the Graylog nodes from 141 to 131 but that didn’t help.

I may stand up a Debian node and connect it to the elastic cluster and see if I get a performance difference.


#6

Haven’t had a chance to setup a debian node to test. Still experiencing slowness.


#7

@jan Upgraded to the latest version today, still no change. I did notice that on the Search page and Source page it will take 15 seconds to load, but in a stream I can run a search for the past 30 days and return the following with a 3 second page load.

Found 107,573,655 messages in 591 ms, searched in 23 indices.
Results retrieved at 2017-10-19 17:30:11.

I’m wondering if it has something to do with my streams. We have a little under 50 streams each with its own index.

Edit: Going to try changing my ideces that had been set to 30 day to 1 day and see if that makes any difference.


#8

Changing the indeces may have helped take off a second at most.

Installed the Graylog MongoDB plugin and checked the query times. Most returned 0ms. Slowest returned .036ms.


#9

I am seeing

017-10-20T12:55:16.462-04:00 ERROR [UsageStatsClusterPeriodical] Uncaught exception in periodical
org.graylog2.indexer.ElasticsearchException: Fetching message count failed for indices [2n-devices_1, 2n-devices_8, 2n-devices_1

An HTTP line is larger than 4096 bytes.

I tried setting http.max_initial_line_length: 64k in /etc/syconfig/elasticsearch but that doesn’t appear to have worked as it might not be the right place to set it. I tried setting it in /etc/elasticsearch/elasticsearch.yml but then elasticsearch doesn’t want to start.

Does anyone know how to set this on a centos/rhel system? Running elasticsearch 5.6.


(Jochen) #10

This will be fixed in Graylog 2.4.0:


#11

Thanks Jochen. Do you know if this might explain the weirdness I’m seeing with slowness mentioned above?

It’s weird because in a stream searching works as expected quickly. But from the search tab it’s 15 seconds to load. I’ve been wondering if this is related to the thread you mentioned that I’ve been following.


(Jochen) #12

Maybe you could check the Network tab of the Developer Console of your web browser to find out which request takes longest.



#13

I had used the dev tools from chrome and ff a while back but didn’t see anything standing out. The one thing that appeared to be the slowest was the bar chart. If I loaded the page I would see the search results show up in 4 seconds and the bar chart might take 12 seconds to finish.

This seems to be resolved now. I ended up basing our indeces on size for rotation and greatly reduced the total number we had. This was something I had been meaning to do regardless of the issue we had to avoid something like a dos filling up our disks.

Reducing the total number of indices seems to have made the biggest impact and things seem to be working at regular speeds now. It is strange that we didn’t have this issue before the upgrade which makes me believe it has to do with how the calls are being made over http to elasticsearch.


(system) #14

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.