Everything is down after 2.2.3 update

KO1984 · April 5, 2017, 1:31am

So, I saw the latest release was available and updated to it. No issues with the update, and everything continued working fine. A few hours later everything suddenly died and refuses to come back up.

I tried the usual fixes (restart the services, restart the machines).
I’m honestly not sure where to start, and I can’t make much sense of the logs.

Graylog server.log

**PRIMARY NODE**
[2017-04-04 20:01:37,494][INFO ][node                     ] [node-1] stopping ...
[2017-04-04 20:01:37,657][INFO ][node                     ] [node-1] stopped
[2017-04-04 20:01:37,658][INFO ][node                     ] [node-1] closing ...
[2017-04-04 20:01:37,664][INFO ][node                     ] [node-1] closed
[2017-04-04 20:01:38,517][INFO ][node                     ] [node-1] version[2.4.4], pid[15625], build[fcbb46d/2017-01-03T11:33:16Z]
[2017-04-04 20:01:38,518][INFO ][node                     ] [node-1] initializing ...
[2017-04-04 20:01:39,155][INFO ][plugins                  ] [node-1] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
[2017-04-04 20:01:39,178][INFO ][env                      ] [node-1] using [1] data paths, mounts [[/ (/dev/sda2)]], net usable_space [2.4tb], net total_space [2.6tb], spins? [possibly], types [ext4]
[2017-04-04 20:01:39,178][INFO ][env                      ] [node-1] heap size [989.8mb], compressed ordinary object pointers [true]
[2017-04-04 20:01:41,123][INFO ][node                     ] [node-1] initialized
[2017-04-04 20:01:41,123][INFO ][node                     ] [node-1] starting ...
[2017-04-04 20:01:41,247][INFO ][transport                ] [node-1] publish_address {192.168.37.52:9300}, bound_addresses {127.0.1.1:9300}, {192.168.37.52:9300}
[2017-04-04 20:01:41,253][INFO ][discovery                ] [node-1] TEST-graylog/OEQt2xlyS3mOH4OCZSmNfw
[2017-04-04 20:02:11,256][WARN ][discovery                ] [node-1] waited for 30s and no initial state was set by the discovery
[2017-04-04 20:02:11,344][INFO ][http                     ] [node-1] publish_address {192.168.37.52:9200}, bound_addresses {127.0.1.1:9200}, {192.168.37.52:9200}
[2017-04-04 20:02:11,344][INFO ][node                     ] [node-1] started
[2017-04-04 20:02:19,555][INFO ][cluster.service          ] [node-1] detected_master {node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}, added {{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300},{TEST-c397e76f-56b3-4ec8-9f99-7e26f77f9f2c}{u-bKA93wQs2DvOI_K30yeQ}{192.168.37.55}{192.168.37.55:9350}{client=true, data=false, master=false},}, reason: zen-disco-receive(from master [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}])
[2017-04-04 20:02:44,312][INFO ][discovery.zen            ] [node-1] failed to send join request to master [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}], reason [ElasticsearchTimeoutException[Timeout waiting for task.]]
[2017-04-04 20:03:49,802][INFO ][cluster.service          ] [node-1] removed {{TEST-c397e76f-56b3-4ec8-9f99-7e26f77f9f2c}{u-bKA93wQs2DvOI_K30yeQ}{192.168.37.55}{192.168.37.55:9350}{client=true, data=false, master=false},}, reason: zen-disco-receive(from master [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}])
[2017-04-04 20:03:50,092][WARN ][transport                ] [node-1] Received response for a request that has timed out, sent [89537ms] ago, timed out [59537ms] ago, action [internal:discovery/zen/fd/master_ping], node [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}], id [13]
[2017-04-04 20:05:02,563][INFO ][discovery.zen            ] [node-1] master_left [{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300}], reason [shut_down]
[2017-04-04 20:05:02,572][WARN ][discovery.zen            ] [node-1] master left (reason = shut_down), current nodes: {{node-1}{OEQt2xlyS3mOH4OCZSmNfw}{192.168.37.52}{192.168.37.52:9300},}
[2017-04-04 20:05:02,578][INFO ][cluster.service          ] [node-1] removed {{node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300},}, reason: zen-disco-master_failed ({node-2}{xQKVr-jxQZGWodwP0hf8iw}{192.168.37.55}{192.168.37.55:9300})
[2017-04-04 20:05:05,648][INFO ][cluster.routing.allocation] [node-1] Cluster health status changed from [GREEN] to [RED] (reason: [nodes joined]).
[2017-04-04 20:05:05,649][INFO ][cluster.service          ] [node-1] new_master {node-1}{OEQt2xlyS3mOH4OCZSmNfw}{192.168.37.52}{192.168.37.52:9300}, reason: zen-disco-join(elected_as_master, [0] joins received)
[2017-04-04 20:05:05,651][INFO ][cluster.routing          ] [node-1] delaying allocation for [190] unassigned shards, next check in [1m]
[2017-04-04 20:05:15,527][INFO ][cluster.service          ] [node-1] added {{node-2}{gXWWdKRJT8O8iGZh-5Xcfg}{192.168.37.55}{192.168.37.55:9300},}, reason: zen-disco-join(pending joins after accumulation stop [election closed])
[2017-04-04 20:05:18,193][INFO ][cluster.routing.allocation] [node-1] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[graylog_0][2], [graylog_0][0]] ...]).
[2017-04-04 20:06:24,181][INFO ][cluster.routing.allocation] [node-1] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[session_login.cgi][3]] ...]).
[2017-04-04 20:07:20,470][INFO ][cluster.service          ] [node-1] added {{TEST-c397e76f-56b3-4ec8-9f99-7e26f77f9f2c}{u-bKA93wQs2DvOI_K30yeQ}{192.168.37.55}{192.168.37.55:9350}{client=true, data=false, master=false},}, reason: zen-disco-join(join from node[{TEST-c397e76f-56b3-4ec8-9f99-7e26f77f9f2c}{u-bKA93wQs2DvOI_K30yeQ}{192.168.37.55}{192.168.37.55:9350}{client=true, data=false, master=false}])

**SECONDARY NODE**
2017-04-04T20:22:41.267-05:00 WARN  [ProxiedResource] Unable to call http://192.168.37.55:9000/api/system on node <c397e76f-56b3-4ec8-9f99-7e26f77f9f2c>
java.net.SocketTimeoutException: timeout
        at okio.Okio$4.newTimeoutException(Okio.java:227) ~[graylog.jar:?]
        at okio.AsyncTimeout.exit(AsyncTimeout.java:284) ~[graylog.jar:?]
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:240) ~[graylog.jar:?]
        at okio.RealBufferedSource.indexOf(RealBufferedSource.java:325) ~[graylog.jar:?]
        at okio.RealBufferedSource.indexOf(RealBufferedSource.java:314) ~[graylog.jar:?]
        at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:210) ~[graylog.jar:?]
        at okhttp3.internal.http1.Http1Codec.readResponse(Http1Codec.java:191) ~[graylog.jar:?]
        at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:132) ~[graylog.jar:?]
        at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:54) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
        at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:59) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:179) ~[graylog.jar:?]
        at okhttp3.RealCall.execute(RealCall.java:63) ~[graylog.jar:?]
        at retrofit2.OkHttpCall.execute(OkHttpCall.java:174) ~[graylog.jar:?]
        at org.graylog2.shared.rest.resources.ProxiedResource.lambda$null$0(ProxiedResource.java:76) ~[graylog.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_121]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_121]
        at java.lang.Thread.run(Thread.java:745) [?:1.8.0_121]
Caused by: java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_121]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_121]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_121]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_121]
        at okio.Okio$2.read(Okio.java:138) ~[graylog.jar:?]
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:236) ~[graylog.jar:?]
        ... 29 more

Not sure what logs would be helpful at this point.
Any help is appreciated!

jochen · April 5, 2017, 7:32am

Please post the complete logs of your Graylog node.

KO1984 · April 5, 2017, 7:39pm

Last 2000 lines of each log:

Node 1:
https://pastebin.com/JqT2TqQ4 (N1 mongodb.log)
https://pastebin.com/jF7t6ZuH (N1 elasticsearch.log)
https://pastebin.com/HPMnXRC1 (N1 graylog-server/server.log)

Node 2:
https://pastebin.com/e1wSfNLM (N2 elasticsearch.log)
https://pastebin.com/i2c8dj84 (N2 mongodb.log)
https://pastebin.com/EbdZkXbZ (Correct N2 server.log link)

If there is a better way to post the logs, please let me know. I didn’t want to post the WHOLE thing as they’re quite lengthy.

jochen · April 6, 2017, 7:49am

There’s only 1 log of a Graylog node, is that correct?

Additionally, please share your Graylog configuration files.

KO1984 · April 6, 2017, 5:39pm

That’s my fault. I pasted the wrong link. There are two graylog node logs.
Sixth link has been changed.

server.conf (some info has been obscured)

is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = xxxxxxxxxxxxxx
root_username = XXXX
root_password_sha2 = xxxxxxxxx
plugin_dir = /usr/share/graylog-server/plugin
rest_listen_uri = http://192.168.37.52:9000/api/
#rest_transport_uri = http://192.168.1.1:9000/api/
web_listen_uri = http://192.168.37.52:9000/
#elasticsearch_config_file = /etc/graylog/server/elasticsearch.yml
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
#elasticsearch_max_size_per_index = 1073741824
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_cluster_name = XXXX-graylog
elasticsearch_node_name_prefix = XXXX-
elasticsearch_discovery_zen_ping_unicast_hosts = GraySrv1:9300, GraySrv2:9300
#elasticsearch_node_master = true
#elasticsearch_node_data = true
elasticsearch_cluster_discovery_timeout = 50000
elasticsearch_network_host = 192.168.37.52
elasticsearch_network_bind_host = 192.168.37.52
elasticsearch_network_publish_host = 192.168.37.52
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true

message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3

stale_master_timeout = 20000

mongodb_uri = mongodb://GraySrv1:27017,GraySrv2:27017,GrayArb:27017/graylog?replicaSet=rs0
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5

transport_email_enabled = true
transport_email_hostname = smtp.XXXXXXX.com
transport_email_port = 25
transport_email_use_auth = false
transport_email_use_tls = false
transport_email_use_ssl = false
transport_email_auth_username = graylog@XXXXXXX.com
#transport_email_auth_password = secret
#transport_email_subject_prefix = [graylog]
transport_email_from_email = graylog@XXXXXXXXX.com
transport_email_web_interface_url = http://graylog.XXXXXXXX.com
content_packs_dir = /usr/share/graylog-server/contentpacks
content_packs_auto_load = grok-patterns.json
proxied_requests_thread_pool_size = 32

server.conf is the same on both nodes, save for the addressing differences.
EDIT: Notable difference are one is set to master=true, the other is set to master=false.
I have had no issues with this configuration up until the incident. I’ve also been running a virtual cluster in the exact same setup (it was a test to make sure everything worked correctly before reformatting the hardware stack), and have had no issues with it.

I’m wondering if it may be an issue between Java and Elasticsearch.
I’ve been trying to find out if it’s due to not enough allocated memory to the java heap, and if so how to properly adjust that.

EDIT2: So I increased the heap size to 10g for Graylog, which didn’t resolve my issue. I was able to find out it’s maxing out its memory usage.

Also some of the ES indexes (according to what little I can see in the WebUI), are returning red. I checked their status using curl, and they all return green (on both nodes). Still not sure what the issue is, but it has something to do with Graylog eating all of its available memory.

Topic		Replies	Views
After graylog 2.5 to 3.0 upgrade getting 503 Service Unavailable Graylog Central (peer support)	3	3567	January 21, 2020
Graylog node memory leak / OOM Graylog Central (peer support) sidecar , nxlog , nodatanx	30	5920	October 18, 2017
Graylog 2.2.1 - No out messages Graylog Central (peer support)	1	1124	August 22, 2017
Graylog want start Graylog Central (peer support)	11	2539	August 15, 2018
Upgraded to Graylog OS 4.1.2 from 4.1.1 - it's busted (Elasticsearch, I think) Graylog Central (peer support)	16	2262	August 18, 2021

Everything is down after 2.2.3 update

Related topics