We want to migrate from Kibana to Graylog, but all users are complaining about sluggish performance.
We run Graylog 5.2 using docker compose, on two nodes. One runs Opensearch 2.11 in three nodes that run on the same host, the other runs Graylog 5.2 with MongoDB 5.0.13. Both hosts have 8 logical Xeon cores and 32GB of RAM, using SSD storage. All on AWS.
The user experience with querying is quite disappointing. First there is a inexplicable delay for “validating” the query clause, no matter how simple it is. Using F12 on the browser, it’s about 700ms. We’d expect this to be almost instantaneous. Then the query executes,taking about 1700ms, even though clicking on the “i” on the left we see that the query took less than 100ms, and then there is a “fields” operation that takes about 1800ms. This operation asks for fields that existed in yesterday’s index and not in today’s index (we cleaned up many unwanted fields yesterday).
Generally, we see the problem being on the UI side rather than the OpenSearch side.
Our indices take about 4GB over an entire day, and we have one index per day. The queries I’m using today are only looking to today’s index, the one that has a few hundred fields, whereas the previous ones had several thousand fields because of “noise” that we eliminated yesterday. Currently as it’s still early the index has 0.5mil entries and takes 630MB. Quite small.
Each index has 4 shards. There are now about 150 indices in storage.
Another Graylog server that runs version 2.4 is a lot faster (!!!). On that server all modules run on the same host. And the host has the same specs as each of our hosts, so it offers much better performance with half the hardware.
My questions are:
Is version 5.2 having performance issues, generally?
What can we do to speed up the UI?
Why is Graylog asking for non-existent fields? Has it cached the fields it saw yesterday?
So you are pushing a total of 3GB a day through this cluster? What does it look like on the system>nodes page, ideally a screenshot of the details of each node showing buffers heap usage etc.
No generally we heard 5 is much faster than previous versions.
I hope this screenshot is what you asked for. Currently the busiest of the three log sources is in low-traffic state, so things will get much busier in a few hours.
But let me stress again: it’s not with the ingesting that we have the problem, it’s with querying. And by “querying” I mean the Graylog side, not the Opensearch side, as explained above.
You may want to look into your mongodb performace, when you see slowness in the UI side of things and the system doesnt seem overloaded by ingest we have seen mongo performance be the problem. First thing to look at for that is disk performance where the db is stored.
The answer was indeed in the MongoDB. Evidently it maintains, for each index, the list of fields in it. We had to remove all indices prior to today, and then the querying became noticeably faster. Thankfully GL is not yet in production so we had the luxury to do that. The “validate” phase of the query also became a lot shorter.
The question we still have is the following:
During querying, there are two consecutive calls to the “fields” API. The first one returns the current list of fields in the index, while the second one returns a longer list that seems to have been assembled over time, and contains fields that no longer exist in the indices.
Do you know how we can manage this list, and perhaps trim it down?
Is it possible to check performance with memory set to 4GB, you only need to make it bigger than default if errors indicate this. To much heap memory can slow down java.