Everything is down after 2.2.3 update

So, I saw the latest release was available and updated to it. No issues with the update, and everything continued working fine. A few hours later everything suddenly died and refuses to come back up.

I tried the usual fixes (restart the services, restart the machines).
I’m honestly not sure where to start, and I can’t make much sense of the logs.

Graylog server.log

**PRIMARY NODE**
[2017-04-04 20:01:37,494][INFO ][node                     ] [node-1] stopping ...
[2017-04-04 20:01:37,657][INFO ][node                     ] [node-1] stopped
[2017-04-04 20:01:37,658][INFO ][node                     ] [node-1] closing ...
[2017-04-04 20:01:37,664][INFO ][node                     ] [node-1] closed
[2017-04-04 20:01:38,517][INFO ][node                     ] [node-1] version[2.4.4], pid[15625], build[fcbb46d/2017-01-03T11:33:16Z]
[2017-04-04 20:01:38,518][INFO ][node                     ] [node-1] initializing ...
[2017-04-04 20:01:39,155][INFO ][plugins                  ] [node-1] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
[2017-04-04 20:01:39,178][INFO ][env                      ] [node-1] using [1] data paths, mounts [[/ (/dev/sda2)]], net usable_space [2.4tb], net total_space [2.6tb], spins? [possibly], types [ext4]
[2017-04-04 20:01:39,178][INFO ][env                      ] [node-1] heap size [989.8mb], compressed ordinary object pointers [true]
[2017-04-04 20:01:41,123][INFO ][node                     ] [node-1] initialized
[2017-04-04 20:01:41,123][INFO ][node                     ] [node-1] starting ...
[2017-04-04 20:01:41,247][INFO ][transport                ] [node-1] publish_address {192.168.37.52:9300}, bound_addresses {127.0.1.1:9300}, {192.168.37.52:9300}
[2017-04-04 20:01:41,253][INFO ][discovery                ] [node-1] TEST-graylog/OEQt2xlyS3mOH4OCZSmNfw
[2017-04-04 20:02:11,256][WARN ][discovery                ] [node-1] waited for 30s and no initial state was set by the discovery
[2017-04-04 20:02:11,344][INFO ][http                     ] [node-1] publish_address {192.168.37.52:9200}, bound_addresses {127.0.1.1:9200}, {192.168.37.52:9200}
[2017-04-04 20:02:11,344][INFO ][node                     ] [node-1] started
[2017-04-04 20:02:19,555][INFO ][cluster.service          ] [node-1] detected_master {node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}, added {{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300},{TEST-c397e76f-56b3-4ec8-9f99-7e26f77f9f2c}{u-bKA93wQs2DvOI_K30yeQ}{192.168.37.55}{192.168.37.55:9350}{client=true, data=false, master=false},}, reason: zen-disco-receive(from master [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}])
[2017-04-04 20:02:44,312][INFO ][discovery.zen            ] [node-1] failed to send join request to master [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2017-04-04 20:03:49,802][INFO ][cluster.service          ] [node-1] removed {{TEST-c397e76f-56b3-4ec8-9f99-7e26f77f9f2c}{u-bKA93wQs2DvOI_K30yeQ}{192.168.37.55}{192.168.37.55:9350}{client=true, data=false, master=false},}, reason: zen-disco-receive(from master [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}])
[2017-04-04 20:03:50,092][WARN ][transport                ] [node-1] Received response for a request that has timed out, sent [89537ms] ago, timed out [59537ms] ago, action [internal:discovery/zen/fd/master_ping], node [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}], id [13]
[2017-04-04 20:05:02,563][INFO ][discovery.zen            ] [node-1] master_left [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}], reason [shut_down]
[2017-04-04 20:05:02,572][WARN ][discovery.zen            ] [node-1] master left (reason = shut_down), current nodes: {{node-1}{OEQt2xlyS3mOH4OCZSmNfw}{192.168.37.52}{192.168.37.52:9300},}
[2017-04-04 20:05:02,578][INFO ][cluster.service          ] [node-1] removed {{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300},}, reason: zen-disco-master_failed ({node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300})
[2017-04-04 20:05:05,648][INFO ][cluster.routing.allocation] [node-1] Cluster health status changed from [GREEN] to [RED] (reason: [nodes joined]).
[2017-04-04 20:05:05,649][INFO ][cluster.service          ] [node-1] new_master {node-1}{OEQt2xlyS3mOH4OCZSmNfw}{192.168.37.52}{192.168.37.52:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2017-04-04 20:05:05,651][INFO ][cluster.routing          ] [node-1] delaying allocation for [190] unassigned shards, next check in [1m]
[2017-04-04 20:05:15,527][INFO ][cluster.service          ] [node-1] added {{node-2}{gXWWdKRJT8O8iGZh-5Xcfg}{192.168.37.55}{192.168.37.55:9300},}, reason: zen-disco-join(pending joins after accumulation stop [election closed])
[2017-04-04 20:05:18,193][INFO ][cluster.routing.allocation] [node-1] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[graylog_0][2], [graylog_0][0]] ...]).
[2017-04-04 20:06:24,181][INFO ][cluster.routing.allocation] [node-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[session_login.cgi][3]] ...]).
[2017-04-04 20:07:20,470][INFO ][cluster.service          ] [node-1] added {{TEST-c397e76f-56b3-4ec8-9f99-7e26f77f9f2c}{u-bKA93wQs2DvOI_K30yeQ}{192.168.37.55}{192.168.37.55:9350}{client=true, data=false, master=false},}, reason: zen-disco-join(join from node[{TEST-c397e76f-56b3-4ec8-9f99-7e26f77f9f2c}{u-bKA93wQs2DvOI_K30yeQ}{192.168.37.55}{192.168.37.55:9350}{client=true, data=false, master=false}])

**SECONDARY NODE**
2017-04-04T20:22:41.267-05:00 WARN  [ProxiedResource] Unable to call http://192.168.37.55:9000/api/system on node <c397e76f-56b3-4ec8-9f99-7e26f77f9f2c>
java.net.SocketTimeoutException: timeout
        at okio.Okio$4.newTimeoutException(Okio.java:227) ~[graylog.jar:?]
        at okio.AsyncTimeout.exit(AsyncTimeout.java:284) ~[graylog.jar:?]
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:240) ~[graylog.jar:?]
        at okio.RealBufferedSource.indexOf(RealBufferedSource.java:325) ~[graylog.jar:?]
        at okio.RealBufferedSource.indexOf(RealBufferedSource.java:314) ~[graylog.jar:?]
        at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:210) ~[graylog.jar:?]
        at okhttp3.internal.http1.Http1Codec.readResponse(Http1Codec.java:191) ~[graylog.jar:?]
        at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:132) ~[graylog.jar:?]
        at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:54) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
        at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:59) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:179) ~[graylog.jar:?]
        at okhttp3.RealCall.execute(RealCall.java:63) ~[graylog.jar:?]
        at retrofit2.OkHttpCall.execute(OkHttpCall.java:174) ~[graylog.jar:?]
        at org.graylog2.shared.rest.resources.ProxiedResource.lambda$null$0(ProxiedResource.java:76) ~[graylog.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_121]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_121]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_121]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_121]
        at okio.Okio$2.read(Okio.java:138) ~[graylog.jar:?]
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:236) ~[graylog.jar:?]
        ... 29 more

Not sure what logs would be helpful at this point.
Any help is appreciated!

Please post the complete logs of your Graylog node.

Last 2000 lines of each log:

Node 1:
https://pastebin.com/JqT2TqQ4 (N1 mongodb.log)
https://pastebin.com/jF7t6ZuH (N1 elasticsearch.log)
https://pastebin.com/HPMnXRC1 (N1 graylog-server/server.log)

Node 2:
https://pastebin.com/e1wSfNLM (N2 elasticsearch.log)
https://pastebin.com/i2c8dj84 (N2 mongodb.log)
https://pastebin.com/EbdZkXbZ (Correct N2 server.log link)

If there is a better way to post the logs, please let me know. I didn’t want to post the WHOLE thing as they’re quite lengthy.

There’s only 1 log of a Graylog node, is that correct?

Additionally, please share your Graylog configuration files.

That’s my fault. I pasted the wrong link. There are two graylog node logs.
Sixth link has been changed.

server.conf (some info has been obscured)

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = xxxxxxxxxxxxxx
root_username = XXXX
root_password_sha2 = xxxxxxxxx
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://192.168.37.52:9000/api/
#rest_transport_uri = http://192.168.1.1:9000/api/
web_listen_uri = http://192.168.37.52:9000/
#elasticsearch_config_file = /etc/graylog/server/elasticsearch.yml
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
#elasticsearch_max_size_per_index = 1073741824
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_cluster_name = XXXX-graylog
elasticsearch_node_name_prefix = XXXX-
elasticsearch_discovery_zen_ping_unicast_hosts = GraySrv1:9300, GraySrv2:9300
#elasticsearch_node_master = true
#elasticsearch_node_data = true
elasticsearch_cluster_discovery_timeout = 50000
elasticsearch_network_host = 192.168.37.52
elasticsearch_network_bind_host = 192.168.37.52
elasticsearch_network_publish_host = 192.168.37.52
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true

message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3

stale_master_timeout = 20000

mongodb_uri = mongodb://GraySrv1:27017,GraySrv2:27017,GrayArb:27017/graylog?replicaSet=rs0
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5

transport_email_enabled = true
transport_email_hostname = smtp.XXXXXXX.com
transport_email_port = 25
transport_email_use_auth = false
transport_email_use_tls = false
transport_email_use_ssl = false
transport_email_auth_username = graylog@XXXXXXX.com
#transport_email_auth_password = secret
#transport_email_subject_prefix = [graylog]
transport_email_from_email = graylog@XXXXXXXXX.com
transport_email_web_interface_url = http://graylog.XXXXXXXX.com
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

server.conf is the same on both nodes, save for the addressing differences.
EDIT: Notable difference are one is set to master=true, the other is set to master=false.
I have had no issues with this configuration up until the incident. I’ve also been running a virtual cluster in the exact same setup (it was a test to make sure everything worked correctly before reformatting the hardware stack), and have had no issues with it.

I’m wondering if it may be an issue between Java and Elasticsearch.
I’ve been trying to find out if it’s due to not enough allocated memory to the java heap, and if so how to properly adjust that.

EDIT2: So I increased the heap size to 10g for Graylog, which didn’t resolve my issue. I was able to find out it’s maxing out its memory usage.

Also some of the ES indexes (according to what little I can see in the WebUI), are returning red. I checked their status using curl, and they all return green (on both nodes). Still not sure what the issue is, but it has something to do with Graylog eating all of its available memory.