Hey @jan,
thanks for taking the time to look into this!
See below for an excerpt from the config.
The 2nd Graylog node is configured almost identically except that it publishes its own IP/port.
http_bind_address = 0.0.0.0:9000
### This is the IP of the Graylog node. Each node publishes its own IP.
http_publish_uri = http://<redacted>.48:9000/
### Leftover of some experimentation. Was not set before (apparently does not make a difference for this issue)
http_thread_pool_size = 32
### Outbound (internet) is only possible via the proxy
http_proxy_uri = http://<redacted>:<redacted>@proxy.<redacted>.local:8080
### <redacted>.48 <- This node
### <redacted>.6 <- 2nd GL node
### <redacted>.204 <- Elasticsearch server
http_non_proxy_hosts = 127.0.0.1,<redacted>.48,<redacted>.6,<redacted>.204
I believe the endpoints to be configured correctly as I can see the communication happening.
It is just that in ~95% of the time the request for the metrics results in a timeout.