So, I saw the latest release was available and updated to it. No issues with the update, and everything continued working fine. A few hours later everything suddenly died and refuses to come back up.
I tried the usual fixes (restart the services, restart the machines).
I’m honestly not sure where to start, and I can’t make much sense of the logs.
Graylog server.log
**PRIMARY NODE**
[2017-04-04 20:01:37,494][INFO ][node ] [node-1] stopping ...
[2017-04-04 20:01:37,657][INFO ][node ] [node-1] stopped
[2017-04-04 20:01:37,658][INFO ][node ] [node-1] closing ...
[2017-04-04 20:01:37,664][INFO ][node ] [node-1] closed
[2017-04-04 20:01:38,517][INFO ][node ] [node-1] version[2.4.4], pid[15625], build[fcbb46d/2017-01-03T11:33:16Z]
[2017-04-04 20:01:38,518][INFO ][node ] [node-1] initializing ...
[2017-04-04 20:01:39,155][INFO ][plugins ] [node-1] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
[2017-04-04 20:01:39,178][INFO ][env ] [node-1] using [1] data paths, mounts [[/ (/dev/sda2)]], net usable_space [2.4tb], net total_space [2.6tb], spins? [possibly], types [ext4]
[2017-04-04 20:01:39,178][INFO ][env ] [node-1] heap size [989.8mb], compressed ordinary object pointers [true]
[2017-04-04 20:01:41,123][INFO ][node ] [node-1] initialized
[2017-04-04 20:01:41,123][INFO ][node ] [node-1] starting ...
[2017-04-04 20:01:41,247][INFO ][transport ] [node-1] publish_address {192.168.37.52:9300}, bound_addresses {127.0.1.1:9300}, {192.168.37.52:9300}
[2017-04-04 20:01:41,253][INFO ][discovery ] [node-1] TEST-graylog/OEQt2xlyS3mOH4OCZSmNfw
[2017-04-04 20:02:11,256][WARN ][discovery ] [node-1] waited for 30s and no initial state was set by the discovery
[2017-04-04 20:02:11,344][INFO ][http ] [node-1] publish_address {192.168.37.52:9200}, bound_addresses {127.0.1.1:9200}, {192.168.37.52:9200}
[2017-04-04 20:02:11,344][INFO ][node ] [node-1] started
[2017-04-04 20:02:19,555][INFO ][cluster.service ] [node-1] detected_master {node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}, added {{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300},{TEST-c397e76f-56b3-4ec8-9f99-7e26f77f9f2c}{u-bKA93wQs2DvOI_K30yeQ}{192.168.37.55}{192.168.37.55:9350}{client=true, data=false, master=false},}, reason: zen-disco-receive(from master [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}])
[2017-04-04 20:02:44,312][INFO ][discovery.zen ] [node-1] failed to send join request to master [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2017-04-04 20:03:49,802][INFO ][cluster.service ] [node-1] removed {{TEST-c397e76f-56b3-4ec8-9f99-7e26f77f9f2c}{u-bKA93wQs2DvOI_K30yeQ}{192.168.37.55}{192.168.37.55:9350}{client=true, data=false, master=false},}, reason: zen-disco-receive(from master [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}])
[2017-04-04 20:03:50,092][WARN ][transport ] [node-1] Received response for a request that has timed out, sent [89537ms] ago, timed out [59537ms] ago, action [internal:discovery/zen/fd/master_ping], node [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}], id [13]
[2017-04-04 20:05:02,563][INFO ][discovery.zen ] [node-1] master_left [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}], reason [shut_down]
[2017-04-04 20:05:02,572][WARN ][discovery.zen ] [node-1] master left (reason = shut_down), current nodes: {{node-1}{OEQt2xlyS3mOH4OCZSmNfw}{192.168.37.52}{192.168.37.52:9300},}
[2017-04-04 20:05:02,578][INFO ][cluster.service ] [node-1] removed {{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300},}, reason: zen-disco-master_failed ({node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300})
[2017-04-04 20:05:05,648][INFO ][cluster.routing.allocation] [node-1] Cluster health status changed from [GREEN] to [RED] (reason: [nodes joined]).
[2017-04-04 20:05:05,649][INFO ][cluster.service ] [node-1] new_master {node-1}{OEQt2xlyS3mOH4OCZSmNfw}{192.168.37.52}{192.168.37.52:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2017-04-04 20:05:05,651][INFO ][cluster.routing ] [node-1] delaying allocation for [190] unassigned shards, next check in [1m]
[2017-04-04 20:05:15,527][INFO ][cluster.service ] [node-1] added {{node-2}{gXWWdKRJT8O8iGZh-5Xcfg}{192.168.37.55}{192.168.37.55:9300},}, reason: zen-disco-join(pending joins after accumulation stop [election closed])
[2017-04-04 20:05:18,193][INFO ][cluster.routing.allocation] [node-1] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[graylog_0][2], [graylog_0][0]] ...]).
[2017-04-04 20:06:24,181][INFO ][cluster.routing.allocation] [node-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[session_login.cgi][3]] ...]).
[2017-04-04 20:07:20,470][INFO ][cluster.service ] [node-1] added {{TEST-c397e76f-56b3-4ec8-9f99-7e26f77f9f2c}{u-bKA93wQs2DvOI_K30yeQ}{192.168.37.55}{192.168.37.55:9350}{client=true, data=false, master=false},}, reason: zen-disco-join(join from node[{TEST-c397e76f-56b3-4ec8-9f99-7e26f77f9f2c}{u-bKA93wQs2DvOI_K30yeQ}{192.168.37.55}{192.168.37.55:9350}{client=true, data=false, master=false}])
**SECONDARY NODE**
2017-04-04T20:22:41.267-05:00 WARN [ProxiedResource] Unable to call http://192.168.37.55:9000/api/system on node <c397e76f-56b3-4ec8-9f99-7e26f77f9f2c>
java.net.SocketTimeoutException: timeout
at okio.Okio$4.newTimeoutException(Okio.java:227) ~[graylog.jar:?]
at okio.AsyncTimeout.exit(AsyncTimeout.java:284) ~[graylog.jar:?]
at okio.AsyncTimeout$2.read(AsyncTimeout.java:240) ~[graylog.jar:?]
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:325) ~[graylog.jar:?]
at okio.RealBufferedSource.indexOf(RealBufferedSource.java:314) ~[graylog.jar:?]
at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:210) ~[graylog.jar:?]
at okhttp3.internal.http1.Http1Codec.readResponse(Http1Codec.java:191) ~[graylog.jar:?]
at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:132) ~[graylog.jar:?]
at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:54) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:59) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:179) ~[graylog.jar:?]
at okhttp3.RealCall.execute(RealCall.java:63) ~[graylog.jar:?]
at retrofit2.OkHttpCall.execute(OkHttpCall.java:174) ~[graylog.jar:?]
at org.graylog2.shared.rest.resources.ProxiedResource.lambda$null$0(ProxiedResource.java:76) ~[graylog.jar:?]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_121]
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_121]
at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_121]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_121]
at okio.Okio$2.read(Okio.java:138) ~[graylog.jar:?]
at okio.AsyncTimeout$2.read(AsyncTimeout.java:236) ~[graylog.jar:?]
... 29 more
Not sure what logs would be helpful at this point.
Any help is appreciated!