Very slow dashboards

I apologize as I know slow dashboards is not a new topic, however, I’ve yet to find a solution that helps.

Infrastructure AWS hosted:
2 x Graylog nodes (m5.2xlarge - 8 cpu and 32GB) (also running mongo - default config)
5 x Elasticsearch 6.8 nodes (i3.4xlarge - 16 cpu and 128GB, 3.5 TB software Raid 0 of physical SSDs)
1 x HAProxy load balancer in front of Graylog

Graylog version 3.3 (16GB allocated for heap) - configured to talk to all 5 nodes
Elasticsearch 6.8.7 (30GB allocated for heap) - configured to use local SSDs in Raid 0 for ES storage.

Pretty much default configs for each (can provide parts if it would help the investigation).

A separate index was created for this particular stream. It’s currently holding 4TB in 144 indices. 5 shards per index with 2 replicas. I had adjusted to 20 shards per index for several days, but it had no effect, if anything the dashboard felt slower to load. Rotation strategy is set to 1 index per hour. Most of our queries do not look past 8 hours, but even a 30 minute window is very slow to load (>60 seconds). Field type refresh interval: 30 seconds
Max segments: 1

The graylog server load average (normal traffic time) 1.40, 1.40, 1.40.
ES load average (normal traffic time) 2.5, 2.5, 2.5
These numbers are obviously estimates, but just showing there is no cpu waiting going on.

The dashboard does have multiple graphs on it. There are 4 tabs each with about the same charts as mentioned below:
1-4 x 1 minute interval line graphs, grouped by a data type (default 8 hour window)
4-6 x pie charts, each by a different data type (30 minute window)
1-2 x table chart of recent messages sorted by timestamp

This particular dashboard is extremely slow to load, taking usually about a minute, if not longer for each refresh. I see no spikes in cpu usage on any of the ES nodes nor the Graylog nodes when loading the page. I have other dashboards for different indices, but significantly less data, that load just fine. Reducing to a 30 minute window is still extremely slow to load.

Any help is appreciated, please let me know if any specific parts of configuration files would be helpful.

Hello, Jim, Welcome back to the community! By moving your latest post to our Daily Challenges we can expect to get responses to your question.

Have you checked out these posts on the subject:

@jfurnier

Just to add from @dscryber stated.
Have you tried IOTOP? its for Monitor Linux Disk I/O Activity and Usage Per-Process Basis. I had to use this awhile back to resolve some issues, which ended up being java.

@dscryber, sorry neither of those seemed to help out. My systems are all running on standalone machines with pretty much maxed resources allocated (30GB - 32GB of ram per resource).

@gsmith, I’ve used that quite a bit and none of the ES or GL systems show more than 1.0% of IO activity. We had an issue in the past accepting so much data (3,000+ message per second @ about 300GB per day), but we switched from Elastic Block Storage to Empirical (directed attached SSD) storage and it fixed that issue. So we have no issues writing the data to ES. The issue seems to be reading from the various streams.

Multiple indices have been created, with various retention values. Most are 4 shard, 1 replica with a rotation period of 1 day. Others are configured with different retention periods because there’s more documents in the index. So it may have a max of 5GB sized indices, with several indices are created per day. Then the streams configured for an index.

Though it doesn’t matter which stream (or index) is searched, it’s always very slow. Dashboards are our my use for Graylog; however, just searching streams is very very slow (30+ seconds for an index that’s less than 100MB in size, configured one per day, and trying to look at a 2 hour period - so there’s “shouldn’t” be much data to filter through).

It seems with age (meaning the more indexes that have been created over the years), the system is just slowing down - even with more ES data nodes in the cluster.

Thanks a lot for the help.

We had kind of the same problem with dashboards. I had a lot of widgets per tab and like 4 - 5 tabs per dashboards. What we did was limit the amount of tabs and increase the Updating from 10 seconds to 2-3 minutes on some widgets. Some we did have to leave alone because we needed the Updating time to be 10 seconds.
This did reduced the time need to refresh the dashboard. so it didnt take forever.
Sorry I cant be more help.

Thanks for the ideas @gsmith, but it’s definitely more than just Dashboards (I’ll update the title if I can). But any search I do is slow, even ones against a small index (<100MB), which would only be searching one based on the period. It’s just much worse for larger indices or when it spans multiple indices.

1 Like

Pinging this to see if any else has any ideas.

ES CPU and IO load all seem fine, that cluster shows no signs of struggling while ingesting data or performing searches. The GL servers themselves also show no indication of high cpu usage or IO waits happening during searching. I rarely even see MondoDB pop up in the top list, even then, the load is very low.

I can navigate through GL system pages without issue. The top level Dashboard page loads fast, all System pages are very fast to respond. It only seems to slow down when the system has to search through indices. Even in cases when only a single (smaller) index should be searched with a short time period of 5 minutes. The more indices that need to be searched, either due to the time period or data stored, the load times increase, though it does not seem proportional to the size or amount of indices searched.

So it seems searching indices, single or multiple, small or large, is always slow when GrayLog searches them.

I have another application which accesses indices in the same ES cluster which has no issues at all performing searching and is extremely fast.

So I’m leaning towards the issue here being GL or MongoDB.

Thanks for any help.

Hello,

Is it possible to show you Graylog Config file?

Here you go @gsmith:

Note, I removed any commented lines, including unset values, so if it’s not included here, it’s set to the default value.

/etc/default/graylog-server

JAVA=/usr/bin/java
GRAYLOG_SERVER_JAVA_OPTS="-Djava.net.preferIPv4Stack=true -Xms16g -Xmx16g -XX:NewRatio=1 -server -XX:+ResizeTLAB -XX:+UseConcMarkSweepGC -XX:+CMSConcurrentMTEnabled -XX:+CMSClassUnloadingEnabled -XX:+UseParNewGC -XX:-OmitStackTraceInFastThrow"
GRAYLOG_SERVER_ARGS=""
GRAYLOG_COMMAND_WRAPPER=""

/etc/graylog/server/server.conf

# Only set on one of the servers
is_master = True

node_id_file = /etc/graylog/server/node-id

password_secret = 12345
root_username = admin
root_password_sha2 = 123abccba321
root_timezone = UTC

plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 172.30.1.1:9000

elasticsearch_hosts = http://es1:9200,http://es2:9200,...http://es5:9200
elasticsearch_max_total_connections = 5000
elasticsearch_analyzer = standard
output_batch_size = 5000

output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 8
outputbuffer_processors = 8

processor_wait_strategy = blocking
ring_size = 65536

inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking

message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_size = 200gb

lb_recognition_period_seconds = 3
mongodb_uri = mongodb://graylog1:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5

transport_email_enabled = true
transport_email_hostname = email-smtp.region.amazonaws.com
transport_email_port = 587
transport_email_use_auth = true
transport_email_use_tls = true
transport_email_use_ssl = false
transport_email_auth_username = abc
transport_email_auth_password = 321
transport_email_subject_prefix = [graylog]
transport_email_from_email = graylog@emails.com

content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json

proxied_requests_thread_pool_size = 32

Checking back in to see if anyone else has any suggestions or if I can provide more details to help solve this issue. The slow performance is a huge pain point for us.

Thanks a lot.

Sorry to keep bumping this, but we really need some kind of advice for a fix. Thanks.

Hi, Jim,

I've pinned your question thread to the Daily Challenges thread where it should get more visibility from members.  Please be sure that you're including as much information as possible.  That may help you get a solution quicker.

What user type are you: Graylog Open, Graylog Small Business <5 GB, Enterprise, or Cloud?

Hey @jfurnier sorry your issue has been so long outstanding, I know slow stuff is frustrating…

First I want to assure you it is 99% likely not a MongoDB issue. In the Graylog environment, MongoDB simply stores and distributes configuration data, no log data whatsoever. Slow dashboard performance is most likely either a processing issue or a search performance issue. Based on your system stats above and the fact that you aren’t seeing any excessive CPU or disk activity, I’m leaning toward the latter being the root cause.

Let’s get started…

Check Log Files

First and foremost, have you checked your Graylog server log or Elasticsearch log for any indicators of issues? Open two shells, one on the graylog node on which you’re loading the dashboard and one on any elasticsearch node with a primary shard (since you have 5 primary shards and 5 ES nodes, should be able to use any node), and execute the following commands to set up a tail of the files:

graylog-node # tail -F /var/log/graylog/server.log
elastic-node # tail -F /var/log/elasticsearch/graylog.log

Then refresh/relaunch the dashboard and see what pops up in the log files.

Index Set Configuration

Your ratio of index size to quantity may be skewed too much toward quantity. You stated for the Index Set in question that it’s configured to create 5 primary shards and 2 replicas per index, and 1 index per hour. That means per day you’re creating 360 shards per day. And at 300GB/day as stated above, that comes out to 2.5 GB/shard, which is really small.

Can you send me the Retention Strategy and Max number of indices values (go to System > Indices, next to the Index Set click Edit, and scroll down to Index Retention Configuration).

Also can you please send me the output of this command which shows elasticsearch cluster health:

elastic-node # curl -X GET "localhost:9200/_cluster/health"

3 Likes

We are an Open user, producing about 300GB of data per day. All of our servers are self hosted at AWS and self managed. All services, except for Monogo, are on stand alone machines as spec’d in the initial post. Mongo is running on the master GL node.

As I understand it most of the issues people had similar to ours, where too many services running on a single under powered machine. Though that doesn’t seem to be our issue.

One note on the initial post, the HAprox load balancer is not used for injecting messages, those are sent directly to the GL node. It’s used for load balancing the web access to the GL nodes.

The issue is searching through streams and loading dashboards, but there’s no apparent high resource usage on the ES cluster or GL nodes when performing these searches.

I’m happy to provide any more details if necessary. Thanks a lot.

1 Like

Hello,
Checking on your status.
After looking through your configuration I must agree what @fountainheadsecurity suggested and what was the outcome of what he was asking?

@jfurnier I’ll echo @fountainheadsecurity and @gsmith. I’d be keen to know more about Elasticsearch in your deployment, specifically the amount and size of shards in the environment. I’m finding that more and more, folks are simply taking the defaults when it comes to index sets and not considering if they make sense. This leads to either oversharding, or large indices, both of which can have serious consequences on Graylog’s performance.

That said, if you can provide the output of the health command, as well as the output of:
curl -X GET http://localhost:9200/_cat/shards

And

curl -u admin 'http://localhost:9000/api/system/indices/index_sets' \ -H 'X-Requested-By: Me'

That’d give a lot more context as to 1.) what’s going on with Elasticsearch and 2.) How you have Graylog configured.

I’m very sorry for the delay, but am very grateful for the responses.

I’m not seeing any specific entries in either logs, when loading a dashboard or performing searches, but I have posted some of the entries at the bottom.

ES Cluster Health

{
	"cluster_name": "ES_Name",
	"status": "green",
	"timed_out": false,
	"number_of_nodes": 11,
	"number_of_data_nodes": 8,
	"active_primary_shards": 7210,
	"active_shards": 19328,
	"relocating_shards": 0,
	"initializing_shards": 0,
	"unassigned_shards": 0,
	"delayed_unassigned_shards": 0,
	"number_of_pending_tasks": 0,
	"number_of_in_flight_fetch": 0,
	"task_max_waiting_in_queue_millis": 0,
	"active_shards_percent_as_number": 100.0
}

Index details, worker_graylog - this is one of the indices that’s being used that’s performing badly

Index prefix: worker_graylog
Shards: 5
Replicas: 2
Field type refresh interval: 30 seconds
Index rotation strategy: Index Size
Max index size: 5073741824 bytes (4.7GiB)
Index retention strategy: Delete
Max number of indices: 400
Current Total Size: 1.7TB
Last closed Index: 4.3GB / 1,713,687 messages

I can see from the above that there’s one index for about 20 minutes of data, but I have the dashboard configured to view 8 hours of data by default, I can see how this is bad. What would be a good size for this index, given that amount of messages and a desired window of about 8 hours? Any recommendations to the number of shards?

Thank you very much for the support.

Shard details - very cut down

worker_graylog_20776             3 p STARTED   266093 821.2mb 172.30.0.11   node-11
worker_graylog_20776             3 r STARTED   266093 821.2mb 172.30.0.3  node-3
worker_graylog_20776             3 r STARTED   266093 821.2mb 172.30.0.2 node-2
worker_graylog_20776             1 p STARTED   266662 821.1mb 172.30.0.4  node-4
worker_graylog_20776             1 r STARTED   266662 821.1mb 172.30.0.6 node-6
worker_graylog_20776             1 r STARTED   266662 821.1mb 172.30.0.11   node-11
worker_graylog_20776             2 p STARTED   265929   823mb 172.30.0.7 node-7
worker_graylog_20776             2 r STARTED   265929 823.1mb 172.30.0.11   node-11
worker_graylog_20776             2 r STARTED   265929 823.1mb 172.30.0.11   node-11
worker_graylog_20776             4 p STARTED   266175 818.1mb 172.30.0.3  node-3
worker_graylog_20776             4 r STARTED   266175 818.1mb 172.30.0.1 node-1
worker_graylog_20776             4 r STARTED   266175 818.1mb 172.30.0.6 node-6
worker_graylog_20776             0 r STARTED   265908 825.5mb 172.30.0.4  node-4
worker_graylog_20776             0 r STARTED   265908 825.5mb 172.30.0.1 node-1
worker_graylog_20776             0 p STARTED   265908 825.5mb 172.30.0.2 node-2
worker_graylog_20673             3 p STARTED   194811 906.2mb 172.30.0.11   node-11
worker_graylog_20673             3 r STARTED   194811 906.3mb 172.30.0.3  node-3
worker_graylog_20673             3 r STARTED   194811 906.3mb 172.30.0.6 node-6
worker_graylog_20673             1 p STARTED   195307 907.1mb 172.30.0.7  node-4
worker_graylog_20673             1 r STARTED   195307 907.1mb 172.30.0.7 node-7
worker_graylog_20673             1 r STARTED   195307 907.1mb 172.30.0.2 node-2
worker_graylog_20673             2 p STARTED   194406 907.7mb 172.30.0.7 node-7
worker_graylog_20673             2 r STARTED   194406 907.6mb 172.30.0.11   node-11
worker_graylog_20673             2 r STARTED   194406 907.7mb 172.30.0.11   node-11
worker_graylog_20673             4 p STARTED   194842 910.3mb 172.30.0.3  node-3
worker_graylog_20673             4 r STARTED   194842 910.4mb 172.30.0.1 node-1
worker_graylog_20673             4 r STARTED   194842 910.4mb 172.30.0.2 node-2
worker_graylog_20673             0 r STARTED   194448 904.7mb 172.30.0.4  node-4
worker_graylog_20673             0 r STARTED   194448 904.7mb 172.30.0.1 node-1
worker_graylog_20673             0 p STARTED   194448 904.7mb 172.30.0.6 node-6

DEBUG message found in ES logs, not finding anything above the debug level, but I do see a few of these. This is just an unexpected value found, which correct me if I’m wrong, the document just gets dropped and is more of an application issue to get fixed.

[DEBUG][o.e.a.b.TransportShardBulkAction] [node-6] [worker_graylog_20818][0] failed to execute bulk item (index) index {[worker_graylog_deflector][message][3bf4da85-f5aa-11eb-b085-0667c7b27054], source[n/a, actual length: [5.9kb], max length: 2kb]}
java.lang.IllegalArgumentException: mapper [ctxt_to] of different type, current_type [long], merged_type [keyword]
	at org.elasticsearch.index.mapper.FieldMapper.doMerge(FieldMapper.java:354) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.index.mapper.NumberFieldMapper.doMerge(NumberFieldMapper.java:1093) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.index.mapper.FieldMapper.merge(FieldMapper.java:340) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.index.mapper.FieldMapper.merge(FieldMapper.java:52) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.index.mapper.ObjectMapper.doMerge(ObjectMapper.java:487) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.index.mapper.RootObjectMapper.doMerge(RootObjectMapper.java:278) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.index.mapper.ObjectMapper.merge(ObjectMapper.java:457) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.index.mapper.RootObjectMapper.merge(RootObjectMapper.java:273) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.index.mapper.Mapping.merge(Mapping.java:91) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.index.mapper.DocumentMapper.merge(DocumentMapper.java:339) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.applyRequest(MetaDataMappingService.java:273) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.cluster.metadata.MetaDataMappingService$PutMappingExecutor.execute(MetaDataMappingService.java:231) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:643) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:270) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:200) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:135) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:681) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) ~[elasticsearch-6.8.7.jar:6.8.7]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) ~[elasticsearch-6.8.7.jar:6.8.7]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_275]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_275]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]

Corresponding WARNing messages in Graylog server.log to the above field issue, but no other warnings or errors when loading dashboards or searching

2021-08-05T00:00:29.430-07:00 WARN  [Messages] Failed to index message: index=<worker_graylog_20825> id=<cebdf5d1-f5ba-11eb-b085-0667c7b27054> error=<{"type":"es_rejected_execution_exception","reason":"rejected execution of processing of [4694646448][indices:data/write/bulk[s][p]]: request: BulkShardRequest [[worker_graylog_20825][1]] containing [308] requests, target allocation id: JD3nDaqqQfe4LRh_Fx5NAg, primary term: 1 on EsThreadPoolExecutor[name = node-4/write, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@5154cb59[Running, pool size = 16, active threads = 16, queued tasks = 203, completed tasks = 2018768090]]"}>

Whew! On its face, the information you’ve presented leads me to think that this is a case of oversharding. I mean, 20K shards is A LOT of shards and there’s overhead with each one of those. Our recommendations mirror those of Elasticsearch: that your shards be between 10-50GB and that your shards stay within a 20:1/shards:gb of cluster heap ratio.

I’d hazard a guess that the oversharding is likely leading to all of your heap being consumed and that were I to look at your indices stats, that shard operations are taking days/weeks to complete versus being done in minutes. So basically, Elasticsearch is starved for resources.

Most of what I’m going to recommend comes down to tuning your index sets. In general, it’s better to have fewer, larger indices (keep in mind that their shards should be below 50GB) than it is to have many smaller indices. So in the example index set you provided, you have your shards configured as:

Shards: 5
Replicas: 2

That’s resulting in 2 replica shards per primary shard, giving you a total of 15 total shards per index. You’re keeping 400 indices, so 15*400=6000 shards. That means that you’ll have to have 300GB of heap in your cluster to reliably support that amount of shards. Based on the specs you provided for the cluster, you only have 150GB of heap in the cluster, so you’re definitely oversubscribed in terms of shards. Now there’s a bit of play that you have in terms of your heap. Elasticsearch has long recommended that folks don’t go above 32GB when it comes to setting heap due to compressed object pointers in Java. HOWEVER, there is a threshold (>48GB) where the performance hit you take by going over 32GB can be overcome. But keep in mind, that’s not generally recommended.

With all of that said, I think the messages you’re seeing in the logs are indicative of the lack of resources in Elasticsearch, so once you tune your index sets and get your shards under control, the dashboards should be more responsive.

2 Likes

@aaronsachs Thank you for the speedy reply. That all makes sense. I’ll get my indices adjusted such that my shards are between 10-50GB in size and there by reducing the total number of shards. I believe I was confused about index size versus shard size.

Based on my configuration of 8 data nodes, each configured with 32GB of heap, I should be able to handle approximately total 5,120 shards efficiently?

32GB heap x 20:1/shards:GB = 640 shards per node x 8 data nodes = 5,120 ideal max shards

Does that sound about right?

Thanks a lot for the help.