Very slow dashboards

I apologize as I know slow dashboards is not a new topic, however, I’ve yet to find a solution that helps.

Infrastructure AWS hosted:
2 x Graylog nodes (m5.2xlarge - 8 cpu and 32GB) (also running mongo - default config)
5 x Elasticsearch 6.8 nodes (i3.4xlarge - 16 cpu and 128GB, 3.5 TB software Raid 0 of physical SSDs)
1 x HAProxy load balancer in front of Graylog

Graylog version 3.3 (16GB allocated for heap) - configured to talk to all 5 nodes
Elasticsearch 6.8.7 (30GB allocated for heap) - configured to use local SSDs in Raid 0 for ES storage.

Pretty much default configs for each (can provide parts if it would help the investigation).

A separate index was created for this particular stream. It’s currently holding 4TB in 144 indices. 5 shards per index with 2 replicas. I had adjusted to 20 shards per index for several days, but it had no effect, if anything the dashboard felt slower to load. Rotation strategy is set to 1 index per hour. Most of our queries do not look past 8 hours, but even a 30 minute window is very slow to load (>60 seconds). Field type refresh interval: 30 seconds
Max segments: 1

The graylog server load average (normal traffic time) 1.40, 1.40, 1.40.
ES load average (normal traffic time) 2.5, 2.5, 2.5
These numbers are obviously estimates, but just showing there is no cpu waiting going on.

The dashboard does have multiple graphs on it. There are 4 tabs each with about the same charts as mentioned below:
1-4 x 1 minute interval line graphs, grouped by a data type (default 8 hour window)
4-6 x pie charts, each by a different data type (30 minute window)
1-2 x table chart of recent messages sorted by timestamp

This particular dashboard is extremely slow to load, taking usually about a minute, if not longer for each refresh. I see no spikes in cpu usage on any of the ES nodes nor the Graylog nodes when loading the page. I have other dashboards for different indices, but significantly less data, that load just fine. Reducing to a 30 minute window is still extremely slow to load.

Any help is appreciated, please let me know if any specific parts of configuration files would be helpful.

Hello, Jim, Welcome back to the community! By moving your latest post to our Daily Challenges we can expect to get responses to your question.

Have you checked out these posts on the subject:

@jfurnier

Just to add from @dscryber stated.
Have you tried IOTOP? its for Monitor Linux Disk I/O Activity and Usage Per-Process Basis. I had to use this awhile back to resolve some issues, which ended up being java.

@dscryber, sorry neither of those seemed to help out. My systems are all running on standalone machines with pretty much maxed resources allocated (30GB - 32GB of ram per resource).

@gsmith, I’ve used that quite a bit and none of the ES or GL systems show more than 1.0% of IO activity. We had an issue in the past accepting so much data (3,000+ message per second @ about 300GB per day), but we switched from Elastic Block Storage to Empirical (directed attached SSD) storage and it fixed that issue. So we have no issues writing the data to ES. The issue seems to be reading from the various streams.

Multiple indices have been created, with various retention values. Most are 4 shard, 1 replica with a rotation period of 1 day. Others are configured with different retention periods because there’s more documents in the index. So it may have a max of 5GB sized indices, with several indices are created per day. Then the streams configured for an index.

Though it doesn’t matter which stream (or index) is searched, it’s always very slow. Dashboards are our my use for Graylog; however, just searching streams is very very slow (30+ seconds for an index that’s less than 100MB in size, configured one per day, and trying to look at a 2 hour period - so there’s “shouldn’t” be much data to filter through).

It seems with age (meaning the more indexes that have been created over the years), the system is just slowing down - even with more ES data nodes in the cluster.

Thanks a lot for the help.

We had kind of the same problem with dashboards. I had a lot of widgets per tab and like 4 - 5 tabs per dashboards. What we did was limit the amount of tabs and increase the Updating from 10 seconds to 2-3 minutes on some widgets. Some we did have to leave alone because we needed the Updating time to be 10 seconds.
This did reduced the time need to refresh the dashboard. so it didnt take forever.
Sorry I cant be more help.

Thanks for the ideas @gsmith, but it’s definitely more than just Dashboards (I’ll update the title if I can). But any search I do is slow, even ones against a small index (<100MB), which would only be searching one based on the period. It’s just much worse for larger indices or when it spans multiple indices.

1 Like

Pinging this to see if any else has any ideas.

ES CPU and IO load all seem fine, that cluster shows no signs of struggling while ingesting data or performing searches. The GL servers themselves also show no indication of high cpu usage or IO waits happening during searching. I rarely even see MondoDB pop up in the top list, even then, the load is very low.

I can navigate through GL system pages without issue. The top level Dashboard page loads fast, all System pages are very fast to respond. It only seems to slow down when the system has to search through indices. Even in cases when only a single (smaller) index should be searched with a short time period of 5 minutes. The more indices that need to be searched, either due to the time period or data stored, the load times increase, though it does not seem proportional to the size or amount of indices searched.

So it seems searching indices, single or multiple, small or large, is always slow when GrayLog searches them.

I have another application which accesses indices in the same ES cluster which has no issues at all performing searching and is extremely fast.

So I’m leaning towards the issue here being GL or MongoDB.

Thanks for any help.

Hello,

Is it possible to show you Graylog Config file?

Here you go @gsmith:

Note, I removed any commented lines, including unset values, so if it’s not included here, it’s set to the default value.

/etc/default/graylog-server

JAVA=/usr/bin/java
GRAYLOG_SERVER_JAVA_OPTS="-Djava.net.preferIPv4Stack=true -Xms16g -Xmx16g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow"
GRAYLOG_SERVER_ARGS=""
GRAYLOG_COMMAND_WRAPPER=""

/etc/graylog/server/server.conf

# Only set on one of the servers
is_master = True

node_id_file = /etc/graylog/server/node-id

password_secret = 12345
root_username = admin
root_password_sha2 = 123abccba321
root_timezone = UTC

plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 172.30.1.1:9000

elasticsearch_hosts = http://es1:9200,http://es2:9200,...http://es5:9200
elasticsearch_max_total_connections = 5000
elasticsearch_analyzer = standard
output_batch_size = 5000

output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 8
outputbuffer_processors = 8

processor_wait_strategy = blocking
ring_size = 65536

inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking

message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_size = 200gb

lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog1:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5

transport_email_enabled = true
transport_email_hostname = email-smtp.region.amazonaws.com
transport_email_port = 587
transport_email_use_auth = true
transport_email_use_tls = true
transport_email_use_ssl = false
transport_email_auth_username = abc
transport_email_auth_password = 321
transport_email_subject_prefix = [graylog]
transport_email_from_email = graylog@emails.com

content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json

proxied_requests_thread_pool_size = 32

Checking back in to see if anyone else has any suggestions or if I can provide more details to help solve this issue. The slow performance is a huge pain point for us.

Thanks a lot.

Sorry to keep bumping this, but we really need some kind of advice for a fix. Thanks.

Hi, Jim,

I've pinned your question thread to the Daily Challenges thread where it should get more visibility from members.  Please be sure that you're including as much information as possible.  That may help you get a solution quicker.

What user type are you: Graylog Open, Graylog Small Business <5 GB, Enterprise, or Cloud?

Hey @jfurnier sorry your issue has been so long outstanding, I know slow stuff is frustrating…

First I want to assure you it is 99% likely not a MongoDB issue. In the Graylog environment, MongoDB simply stores and distributes configuration data, no log data whatsoever. Slow dashboard performance is most likely either a processing issue or a search performance issue. Based on your system stats above and the fact that you aren’t seeing any excessive CPU or disk activity, I’m leaning toward the latter being the root cause.

Let’s get started…

Check Log Files

First and foremost, have you checked your Graylog server log or Elasticsearch log for any indicators of issues? Open two shells, one on the graylog node on which you’re loading the dashboard and one on any elasticsearch node with a primary shard (since you have 5 primary shards and 5 ES nodes, should be able to use any node), and execute the following commands to set up a tail of the files:

graylog-node # tail -F /var/log/graylog/server.log
elastic-node # tail -F /var/log/elasticsearch/graylog.log

Then refresh/relaunch the dashboard and see what pops up in the log files.

Index Set Configuration

Your ratio of index size to quantity may be skewed too much toward quantity. You stated for the Index Set in question that it’s configured to create 5 primary shards and 2 replicas per index, and 1 index per hour. That means per day you’re creating 360 shards per day. And at 300GB/day as stated above, that comes out to 2.5 GB/shard, which is really small.

Can you send me the Retention Strategy and Max number of indices values (go to System > Indices, next to the Index Set click Edit, and scroll down to Index Retention Configuration).

Also can you please send me the output of this command which shows elasticsearch cluster health:

elastic-node # curl -X GET "localhost:9200/_cluster/health"

3 Likes

We are an Open user, producing about 300GB of data per day. All of our servers are self hosted at AWS and self managed. All services, except for Monogo, are on stand alone machines as spec’d in the initial post. Mongo is running on the master GL node.

As I understand it most of the issues people had similar to ours, where too many services running on a single under powered machine. Though that doesn’t seem to be our issue.

One note on the initial post, the HAprox load balancer is not used for injecting messages, those are sent directly to the GL node. It’s used for load balancing the web access to the GL nodes.

The issue is searching through streams and loading dashboards, but there’s no apparent high resource usage on the ES cluster or GL nodes when performing these searches.

I’m happy to provide any more details if necessary. Thanks a lot.

1 Like

Hello,
Checking on your status.
After looking through your configuration I must agree what @fountainheadsecurity suggested and what was the outcome of what he was asking?

@jfurnier I’ll echo @fountainheadsecurity and @gsmith. I’d be keen to know more about Elasticsearch in your deployment, specifically the amount and size of shards in the environment. I’m finding that more and more, folks are simply taking the defaults when it comes to index sets and not considering if they make sense. This leads to either oversharding, or large indices, both of which can have serious consequences on Graylog’s performance.

That said, if you can provide the output of the health command, as well as the output of:
curl -X GET http://localhost:9200/_cat/shards

And

curl -u admin 'http://localhost:9000/api/system/indices/index_sets' \ -H 'X-Requested-By: Me'

That’d give a lot more context as to 1.) what’s going on with Elasticsearch and 2.) How you have Graylog configured.