Troubkle upgrading from 3 to 4

Hello,

I have problem with upgrading from 3 to latest 4, 8 nodes.

graylog-server-3.3.13-1.noarch
mongo 4.2
es 6.8

After upgrade seems to work fine but in a short time it hanges, logs seem to be rolling fine but GUI is unresponsive

I get some errors like:

Fielddata is disabled on text fields by default. Set fielddata=true on [message] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.
ElasticsearchException{message=Search type returned error:

Fielddata is disabled on text fields by default. Set fielddata=true on [message] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead., errorDetails=[Fielddata is disabled on text fields by default.
Set fielddata=true on [message] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory. Alternatively use a keyword field instead.]}
at org.graylog.storage.elasticsearch6.jest.JestUtils.specificException(JestUtils.java:122)
at org.graylog.storage.elasticsearch6.views.ElasticsearchBackend.doRun(ElasticsearchBackend.java:255)
at org.graylog.storage.elasticsearch6.views.ElasticsearchBackend.doRun(ElasticsearchBackend.java:69)
at org.graylog.plugins.views.search.engine.QueryBackend.run(QueryBackend.java:83)
at org.graylog.plugins.views.search.engine.QueryEngine.prepareAndRun(QueryEngine.java:164)
at org.graylog.plugins.views.search.engine.QueryEngine.lambda$execute$6(QueryEngine.java:104)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

And timeouts:

2021-06-18T23:48:55.458+02:00 WARN [ProxiedResource] Unable to call https://1.graylog.my.domain:9000/api/system/metrics/multiple on node <5f31a2e0-320b-4a06-83e3-ebe3ba806102>: timeout
java.util.concurrent.TimeoutException: null

endpoint config, times eight
http_bind_address = 1.graylog.my.domain:9000
http_external_uri = https://graylog.my.domain/
http_enable_cors = true

Hello,
Maybe I can help.

Looks like your using HTTPS and with the logs files it showing that it can not connect to your HTTPS URL. What have you check so far to resolve this issue?
Could you show you server.conf file?

For an example here is mine, maybe this might helps

http_bind_address = graylog.domain.net:9000
http_publish_uri = https://graylog.domain.net:9000/
http_enable_cors = true
http_enable_tls = true
http_tls_cert_file = /etc/ssl/certs/graylog/graylog-certificate.pem
http_tls_key_file = /etc/ssl/certs/graylog/graylog-key.pem
http_tls_key_password = secret
1 Like

Hey, thanks for Your answer. The problem is, everything works fine for some time, and after that it looks like some thread pool being full or smth.

Hello,

Actually, that sound familiar. I had to adjust my configuration file to match my CPU cores.

Example: I have Server with 12 cores so I confgiured my Graylog ( server.conf) file as such;

processbuffer_processors = 6
outputbuffer_processors = 3
inputbuffer_processors = 2

I left one core for the system, not 100% sure that will fix your issue.

1 Like

I’d be keen to know more about your cluster. Specifically:

  • How many ES nodes?
  • How much heap per node?
  • How many shards?
  • What are the shard sizes?

And along with that, for Graylog:

  • How many shards do you have configured in your index sets?
  • What’s the retention look like
    • What retention strategy are you using?
    • How many indices are you keeping?

There have been a rash of issues I’ve seen lately where a poorly configured Graylog leads to ES issues, which are manifested back in Graylog.

2 Likes

@aaronsachs

  • 8 ES nodes
  • 50 GB RAM, 25GB heap
  • cca 10500 shards
  • 20 GB per index / 8
  • 1200 indices

Default index set 1,201 indices, 18,650,397,348 documents, 18.3TiB default

The Graylog default index set. Graylog will use this index set by default.

Index prefix:

graylog

Shards:

8

Replicas:

0

Field type refresh interval:

5 seconds

Index rotation strategy:

Index Size

Max index size:

21474836480 bytes (20.0GiB)

Index retention strategy:

Delete

Max number of indices:

1200

@gsmith

14 cores

processbuffer_processors = 9
outputbuffer_processors = 3
inputbuffer_processors = 38

i thought they are just java threads which wait for CPU.

So going off the info you provided, I’d say your cluster is over-sharded.

Elasticsearch has a hard coded 10k shards IIRC. If you follow their recommendations for sharding, you should have a 20:1 ratio of shards to cluster heap. So at that amount of shards, you’d need roughly 500GB of heap to support your deployment.

The behavior you mentioned here:

Would be best explained that when restarting, your heap is freed until index operations start to pick up.

My advice is to do some tuning on your cluster. If you’re capping index sets off at 20GB, then 8 shards is excessive. I’d cap my index set size at something like 120-160GB and drop my shards to 4. You can change that now, but you’ll still have to start archiving indices to drop the amount of shards down. You could try shrinking the indices using the /shrink api in Elasticsearch, but at that amount of shards, you won’t have any free for the reindex operation that has to occur.

Your best bet is to get rid of old data you no longer need.

1 Like

@aaronsachs

But i do not change anything on elastic when upgrading graylog from 3 to 4. What is more i see indexing in logs not being able to access graylog GUI. I restart graylog server and free heap on elastic? I dont get it.

I genuinely don’t think that this a version issue. Your deployment is clearly oversharded and doesn’t have enough resources to support the amount of indices and shards that you have. I’m confident that given time, you’d probably see the same issues start to creep up in 3 as well. You must:

  1. Decrease the amount of shards by deleting indices and not setting such a high shard count
  2. Give the system more resources–the best practice here is to scale Elasticsearch out, rather than up, so adding more Elasticsearch nodes would be more beneficial than trying to provision more RAM/heap for it.

@aaronsachs

It works for years on graylog 3 now without such issues. Version 4 goes unresponsive after few minutes. Elasticsearch cluster is green and working when looking from kibana during graylog failure. I am aware of too weak ES in this setup. By unresponsive GUI i do not mean hanging searchers or things like that. I cannot even get the login page.

I get the behavior you’re describing. But the numbers you provided don’t lie–you do not have enough resources for your deployment. There’s no getting around that. I’m actually amazed that your deployment didn’t fall over on version 3.

Let me put it this way–you’re describing a symptom of a problem. Just because that symptom didn’t exist under version 3 doesn’t mean that there’s a problem, and I’m telling you candidly and honestly, you have a problem with too many shards and not enough resources. What you take with the advice I’ve provided is up to you, but I guarantee that if you don’t address the underlying problem, there will continue to be issues.

1 Like

@aaronsachs

Changed to

Index prefix:
graylog
Shards:
4
Replicas:
0
Field type refresh interval:
5 seconds

Index rotation strategy:
Index Size
Max index size:
160000000000 bytes (149.0GiB)

Index retention strategy:
Delete
Max number of indices:
160

1 Like

Excellent. IIRC rotating the active write index should kick Graylog and make it delete any indices over the configured retention. That would save you the trouble of having to do so manually.