I have problem with upgrading from 3 to latest 4, 8 nodes.
graylog-server-3.3.13-1.noarch
mongo 4.2
es 6.8
After upgrade seems to work fine but in a short time it hanges, logs seem to be rolling fine but GUI is unresponsive
I get some errors like:
Fielddata is disabled on text fields by default. Set fielddata=true on [message] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
ElasticsearchException{message=Search type returned error:
Fielddata is disabled on text fields by default. Set fielddata=true on [message] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead., errorDetails=[Fielddata is disabled on text fields by default.
Set fielddata=true on [message] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.]}
at org.graylog.storage.elasticsearch6.jest.JestUtils.specificException(JestUtils.java:122)
at org.graylog.storage.elasticsearch6.views.ElasticsearchBackend.doRun(ElasticsearchBackend.java:255)
at org.graylog.storage.elasticsearch6.views.ElasticsearchBackend.doRun(ElasticsearchBackend.java:69)
at org.graylog.plugins.views.search.engine.QueryBackend.run(QueryBackend.java:83)
at org.graylog.plugins.views.search.engine.QueryEngine.prepareAndRun(QueryEngine.java:164)
at org.graylog.plugins.views.search.engine.QueryEngine.lambda$execute$6(QueryEngine.java:104)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Looks like your using HTTPS and with the logs files it showing that it can not connect to your HTTPS URL. What have you check so far to resolve this issue?
Could you show you server.conf file?
For an example here is mine, maybe this might helps
So going off the info you provided, I’d say your cluster is over-sharded.
Elasticsearch has a hard coded 10k shards IIRC. If you follow their recommendations for sharding, you should have a 20:1 ratio of shards to cluster heap. So at that amount of shards, you’d need roughly 500GB of heap to support your deployment.
The behavior you mentioned here:
Would be best explained that when restarting, your heap is freed until index operations start to pick up.
My advice is to do some tuning on your cluster. If you’re capping index sets off at 20GB, then 8 shards is excessive. I’d cap my index set size at something like 120-160GB and drop my shards to 4. You can change that now, but you’ll still have to start archiving indices to drop the amount of shards down. You could try shrinking the indices using the /shrink api in Elasticsearch, but at that amount of shards, you won’t have any free for the reindex operation that has to occur.
Your best bet is to get rid of old data you no longer need.
But i do not change anything on elastic when upgrading graylog from 3 to 4. What is more i see indexing in logs not being able to access graylog GUI. I restart graylog server and free heap on elastic? I dont get it.
I genuinely don’t think that this a version issue. Your deployment is clearly oversharded and doesn’t have enough resources to support the amount of indices and shards that you have. I’m confident that given time, you’d probably see the same issues start to creep up in 3 as well. You must:
Decrease the amount of shards by deleting indices and not setting such a high shard count
Give the system more resources–the best practice here is to scale Elasticsearch out, rather than up, so adding more Elasticsearch nodes would be more beneficial than trying to provision more RAM/heap for it.
It works for years on graylog 3 now without such issues. Version 4 goes unresponsive after few minutes. Elasticsearch cluster is green and working when looking from kibana during graylog failure. I am aware of too weak ES in this setup. By unresponsive GUI i do not mean hanging searchers or things like that. I cannot even get the login page.
I get the behavior you’re describing. But the numbers you provided don’t lie–you do not have enough resources for your deployment. There’s no getting around that. I’m actually amazed that your deployment didn’t fall over on version 3.
Let me put it this way–you’re describing a symptom of a problem. Just because that symptom didn’t exist under version 3 doesn’t mean that there’s a problem, and I’m telling you candidly and honestly, you have a problem with too many shards and not enough resources. What you take with the advice I’ve provided is up to you, but I guarantee that if you don’t address the underlying problem, there will continue to be issues.
Excellent. IIRC rotating the active write index should kick Graylog and make it delete any indices over the configured retention. That would save you the trouble of having to do so manually.