Upgrade of Graylog appliance (omnibus) failed etcd issue

Hi,

I was trying to upgrade my current Graylog (omnibus appliance) from 2.2.3-1 to the latest 2.3.0-1 version.
I followed this guide http://docs.graylog.org/en/2.3/pages/configuration/graylog_ctl.html#graylog-ctl (Upgrade Graylog)
but it seems my Graylog appliance is no longer starting due to a problem with etcd.

graylog-ctl tail etcd says:

    2017-08-01_15:29:26.34567 2017-08-01 17:29:26.345265 I | etcdmain: etcd Version: 3.2.4
2017-08-01_15:29:26.34836 2017-08-01 17:29:26.347583 I | etcdmain: Git SHA: c31bec0
2017-08-01_15:29:26.34928 2017-08-01 17:29:26.348290 I | etcdmain: Go Version: go1.8.3
2017-08-01_15:29:26.35166 2017-08-01 17:29:26.348358 I | etcdmain: Go OS/Arch: linux/amd64
2017-08-01_15:29:26.35273 2017-08-01 17:29:26.348415 I | etcdmain: setting maximum number of CPUs to 4, total number of available CPUs is 4
2017-08-01_15:29:26.35470 2017-08-01 17:29:26.348886 N | etcdmain: the server is already initialized as member before, starting as etcd member...
2017-08-01_15:29:26.35685 2017-08-01 17:29:26.351577 I | embed: listening for peers on http://localhost:2380
2017-08-01_15:29:26.35828 2017-08-01 17:29:26.352294 I | embed: listening for client requests on 0.0.0.0:2379
2017-08-01_15:29:26.36032 2017-08-01 17:29:26.352429 I | embed: listening for client requests on 0.0.0.0:4001
2017-08-01_15:29:26.36530 2017-08-01 17:29:26.365076 W | snap: skipped unexpected non snapshot file 0000000000000018-0000000000be24be.snap.broken
2017-08-01_15:29:26.36621 2017-08-01 17:29:26.365158 W | snap: skipped unexpected non snapshot file 000000000000001a-0000000000be4bd1.snap.broken
2017-08-01_15:29:26.36752 2017-08-01 17:29:26.365169 W | snap: skipped unexpected non snapshot file 0000000000000019-0000000000be4bcf.snap.broken
2017-08-01_15:29:26.36968 2017-08-01 17:29:26.365237 W | snap: skipped unexpected non snapshot file 0000000000000009-000000000004e220.snap.broken
2017-08-01_15:29:26.37143 2017-08-01 17:29:26.365290 W | snap: skipped unexpected non snapshot file 000000000000000a-0000000000050931.snap.broken
2017-08-01_15:29:26.37863 2017-08-01 17:29:26.378369 I | etcdserver: recovered store from snapshot at index 20472049
2017-08-01_15:29:26.39812 2017-08-01 17:29:26.397875 C | etcdserver: recovering backend from snapshot error: database snapshot file path error: snap: snapshot f       ile doesn't exist
2017-08-01_15:29:26.40179 panic: recovering backend from snapshot error: database snapshot file path error: snap: snapshot file doesn't exist
2017-08-01_15:29:26.40502       panic: runtime error: invalid memory address or nil pointer dereference
2017-08-01_15:29:26.40630 [signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0xb5129c]
2017-08-01_15:29:26.40814
2017-08-01_15:29:26.41164 goroutine 1 [running]:
2017-08-01_15:29:26.41488 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer.func1(0xc4201d8678, 0xc4201d8470)
2017-08-01_15:29:26.41704       /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcds       erver/server.go:279 +0x3c
2017-08-01_15:29:26.41865 panic(0xd628e0, 0xc4202c3c20)
2017-08-01_15:29:26.42106       /usr/local/go/src/runtime/panic.go:489 +0x2cf
2017-08-01_15:29:26.42226 github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc420189fc0, 0xf31eff, 0x2a, 0xc4201d84e0, 0       x1, 0x1)
2017-08-01_15:29:26.42836       /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnsl       og/pkg_logger.go:75 +0x15c
2017-08-01_15:29:26.42979 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdserver.NewServer(0xc42028e000, 0x0, 0x1402540, 0xc4202c3bb0)
2017-08-01_15:29:26.43199       /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcds       erver/server.go:374 +0x2e39
2017-08-01_15:29:26.43397 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed.StartEtcd(0xc420272000, 0x0, 0x0, 0x0)
2017-08-01_15:29:26.43785       /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/embed       /etcd.go:147 +0x7c0
2017-08-01_15:29:26.44008 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcd(0xc420272000, 0x6, 0xf0e97d, 0x6, 0x1)
2017-08-01_15:29:26.44228       /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdm       ain/etcd.go:186 +0x58
2017-08-01_15:29:26.44358 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.startEtcdOrProxyV2()
2017-08-01_15:29:26.45187       /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdm       ain/etcd.go:103 +0x15ba
2017-08-01_15:29:26.45375 github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdmain.Main()
2017-08-01_15:29:26.45581       /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/etcdm       ain/main.go:39 +0x61
2017-08-01_15:29:26.45837 main.main()
2017-08-01_15:29:26.46163       /home/gyuho/go/src/github.com/coreos/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/etcd/main.go:28 +0x20
^C/opt/graylog/embedded/lib/ruby/gems/2.1.0/gems/omnibus-ctl-0.0.7/lib/omnibus-ctl.rb:290:in `system': Interrupt
        from /opt/graylog/embedded/lib/ruby/gems/2.1.0/gems/omnibus-ctl-0.0.7/lib/omnibus-ctl.rb:290:in `tail'
        from /opt/graylog/embedded/lib/ruby/gems/2.1.0/gems/omnibus-ctl-0.0.7/lib/omnibus-ctl.rb:366:in `run'
        from /opt/graylog/embedded/lib/ruby/gems/2.1.0/gems/omnibus-ctl-0.0.7/bin/omnibus-ctl:27:in `<top (required)>'
        from /opt/graylog/embedded/bin/omnibus-ctl:23:in `load'
        from /opt/graylog/embedded/bin/omnibus-ctl:23:in `<main>'

Thankfully I made a VMware snapshot before, so I can role-back
Do you have any ideas why it fails for me though?

cheers,
micsnare

Does the upgrade also fail if you try it again?

It looks like it’s either a bug in the version of etcd being used or faulty hardware.

Hi Jochen,

I rolled-back and tried the upgrade again, with the same result.
It says,

Running handlers:
[2017-08-02T14:40:04+02:00] ERROR: Running exception handlers
Running handlers complete
[2017-08-02T14:40:04+02:00] ERROR: Exception handlers complete
Chef Client failed. 13 resources updated in 01 minutes 34 seconds
[2017-08-02T14:40:04+02:00] FATAL: Stacktrace dumped to /opt/graylog/embedded/cookbooks/cache/chef-stacktrace.out
[2017-08-02T14:40:04+02:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2017-08-02T14:40:04+02:00] ERROR: ruby_block[add node to server list] (graylog::graylog-server line 84) had an error: Errno::ECONNREFUSED: Connection refused - connect(2) for "127.0.0.1" port 4001
[2017-08-02T14:40:04+02:00] FATAL: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)

should etcd be listening on port 4001?
I’ve uploaded the stacktrace here:
https://pastebin.com/qFgmRTYS

but it doesn’t say much…hmm

Hi,
technically the upgrade of etcd should work out of the box. The errors in your logs could indicate a corrupted etcd database long before the upgrade. Maybe the disk ran out of space or something different.
If this is a single server setup you can delete the etcd data directory and start it from scratch:

sudo graylog-ctl stop etcd
cd /var/opt/graylog/data/etcd/
sudo rm -r *
sudo graylog-ctl start etcd
sudo graylog-ctl reconfigure

If you are running a cluster you have to repeate the set-cluster-master and refonfigure-as- steps.
Hope that helps!

Marius.

Hi marius,

thanks for the advice, this actually worked pretty well for me, with one exception that after these steps (and even a reboot) the graylog server can no longer find its elasticsearch node?

here are the following logs…
https://pastebin.com/Xw0nncc6

any idea what went wrong during the update?

cheers,
micsnare

I tried this morning to upgrade and have attached the section of output showing the error. Everything before that seems to be OK. Is this just a simple config change for the port 4001 failing? Should I open another topic since this appears to be ruby related?

jeff

  * execute[/opt/graylog/embedded/bin/graylog-ctl start graylog-server] action run
    - execute /opt/graylog/embedded/bin/graylog-ctl start graylog-server
  * ruby_block[add node to server list] action run
    
    ================================================================================
    Error executing action `run` on resource 'ruby_block[add node to server list]'
    ================================================================================
    
    Errno::ECONNREFUSED
    -------------------
    Connection refused - connect(2) for "127.0.0.1" port 4001
    
    Cookbook Trace:
    ---------------
    /opt/graylog/embedded/cookbooks/graylog/libraries/registry.rb:17:in `set_master'
    /opt/graylog/embedded/cookbooks/graylog/recipes/graylog-server.rb:86:in `block (2 levels) in from_file'
    
    Resource Declaration:
    ---------------------
    # In /opt/graylog/embedded/cookbooks/graylog/recipes/graylog-server.rb
    
     84: ruby_block "add node to server list" do
     85:   block do
     86:     $registry.set_master
     87:     $registry.add_gl_server(node['ipaddress'])
     88:     $registry.add_es_node(node['ipaddress'])
     89:   end
     90:   retries 15
     91: end
    
    Compiled Resource:
    ------------------
    # Declared in /opt/graylog/embedded/cookbooks/graylog/recipes/graylog-server.rb:84:in `from_file'
    
    ruby_block("add node to server list") do
      action [:run]
      retries 15
      retry_delay 2
      default_guard_interpreter :default
      block_name "add node to server list"
      declared_type :ruby_block
      cookbook_name :graylog
      recipe_name "graylog-server"
      block #<Proc:0x00000003943718@/opt/graylog/embedded/cookbooks/graylog/recipes/graylog-server.rb:85>
    end
    
Recipe: graylog::elasticsearch
  * service[elasticsearch] action restart
    - restart service service[elasticsearch]
Recipe: graylog::graylog-server
  * service[graylog-server] action restart
    - restart service service[graylog-server]

Running handlers:
[2017-08-04T16:02:04+00:00] ERROR: Running exception handlers
Running handlers complete
[2017-08-04T16:02:04+00:00] ERROR: Exception handlers complete
Chef Client failed. 15 resources updated in 02 minutes 31 seconds
[2017-08-04T16:02:04+00:00] FATAL: Stacktrace dumped to /opt/graylog/embedded/cookbooks/cache/chef-stacktrace.out
[2017-08-04T16:02:04+00:00] FATAL: Please provide the contents of the stacktrace.out file if you file a bug report
[2017-08-04T16:02:04+00:00] ERROR: ruby_block[add node to server list] (graylog::graylog-server line 84) had an error: Errno::ECONNREFUSED: Connection refused - connect(2) for "127.0.0.1" port 4001
[2017-08-04T16:02:04+00:00] FATAL: Chef::Exceptions::ChildConvergeError: Chef run process exited unsuccessfully (exit code 1)
ubuntu@graylog:~$

I did the this update as well and it was a bit of a train wreck.
Deleting the ETCD files fixed my ETCD service down issue.
It also stopped a few thousand log entries a minute being logged.
Now I need to get NGINX set back up as all of it’s configuration appears to have disappeared.
At least I can log in via port 9000, but no SSL yet.
This is more of a bump then anything else so the developers hopefully see this post.

Thanks,

Keith

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.