1. Describe your incident:
Starting yesterday (3/20), two of my 40ish sidecar collectors lost the ability to connect to the Graylog server (one openSUSE 15.3, one RHEL8). Both of these nodes are acting the same, but all of my troubleshooting has been focused on the openSUSE 15.3 server thus far.
The Graylog Server shows the sidecar node as inactive, and the collector (journalbeat) has an “Unknown” status with the message “Received no ping signal from sidecar(18 hours ago)”.
Focusing on the openSUSE 15.3 node, the node in question has not had any hardware changes, nor any significant software updates that our other nodes have not received. The graylog-sidecar service starts without errors, the server can ping the graylog server and telnet to port 9000, and can even curl into the API. The sidecar.yml file has two entries (server_url and server_api_token), both of which are correct and match other nodes. Despite this, the Graylog server stubbornly insists that the sidecar is inactive.
I have verified that the node-id listed in the Graylog sidecar page matches the node-id in the node-id file on the sidecar marchine.
2. Describe your environment:
OS Information: openSUSE 15.3 Linux (sidecar and server)
Service logs, configurations, and environment variables:
sidecar.log:
time=“2022-03-21T09:33:24-04:00” level=info msg=“Stopping signal distributor”
time=“2022-03-21T09:33:39-04:00” level=info msg=“Starting signal distributor”
time=“2022-03-21T09:49:11-04:00” level=info msg=“Stopping signal distributor”
time=“2022-03-21T10:18:46-04:00” level=info msg=“Starting signal distributor”
There is this error from yesterday when the Graylog server was being rebooted for kernel patches, and it appears on the working nodes as well:
time=“2022-03-20T08:41:31-04:00” level=error msg=“[UpdateRegistration] Failed to report collector status to server: Put "http s://graylog.domain.com:9000/api/sidecars/9c5e38be-67b7-4ae4-a6cd-fbe00db9d9e9": dial tcp 192.168.1.79:9000: connect: connection refused”
(space added so I can post this)
3. What steps have you already taken to try and solve the problem?
-Restart graylog-sidecar
-reinstall graylog-sidecar
-tracelog graylog.domain.com (success, IP resolved correctly)
-telnet graylog.domain.com 9000 (success)
-curl -i -u username:password -H ‘Accept: application/json’ ‘http s://graylog.domain.com:9000/api/plugins/org.graylog.plugins.collector/configuration’ (space added so I can post this)
output:
HTTP/1.1 204 No Content
Cache-Control: private, no-transform
X-Graylog-Node-ID: 6700cf0f-9938-4fca-b69a-981bb65ea8e6
attempted to apply a filebeat configuration to the node, but it too will not deploy (presumably because Graylog is convinced that this node is inactive and unavailable)
4. How can the community help?
Trying to reconnect this inactive node back to Graylog so that the collector will start sending data again.
You can stop the sidecar service and run it manually to see what comes up - here is a normal manual startup from my ubuntu system:
$ sudo /usr/bin/graylog-sidecar
INFO[0000] Using node-id: r9999992-ab12-4444-2323-yabbadabbadoo
INFO[0000] No node name was configured, falling back to hostname
INFO[0000] Starting signal distributor
INFO[0010] Adding process runner for: filebeat
INFO[0010] [filebeat] Configuration change detected, rewriting configuration file.
INFO[0010] [filebeat] Starting (exec driver)
What version of filebeat/nxlog do you have installed? you can grab your (obfuscated please) sidecar configuration with:
I’m not seeing anything from manually running it that didn’t show up in the log:
# /usr/bin/graylog-sidecar
INFO[0000] Using node-id: 9c5e38be-67b7-4ae4-a6cd-fbe00db9d9e9
INFO[0000] No node name was configured, falling back to hostname
INFO[0000] Starting signal distributor
and it’s been sitting there for a couple minutes now, similar to what I see in the logs.
Filebeat: filebeat-7.17.1-1
Journalbeat: journalbeat-7.15.2-1
Sidecar configuration is pretty basic, with just two entries:
That’s the fun part, the configurations match between what is working and what isn’t.
And even if a specific collector wasn’t working, I would expect the node itself to be able to check in and be active within the Graylog interface.
To confirm this, I just installed the sidecar on a new server using the exact same configuration (different node-id) and it is available in the Collectors Administration page waiting for me to deploy a collector configuration to it.
Also, it’s been an hour and my manually started sidecar has not updated its output - it is still on ‘Starting signal distributor’
It looks like I need to dig deeper into what the service does on startup since that’s where the process is breaking down.
You can ignore the collector configuration that is applied - that means nothing until you client can connect to Graylog. Have you looked at permissions on all the sidecar related files?
What about permissions on the journalbeat/filebeat directories? Those are typically installed on linux separately I think? It looks like the sidecar itself may be starting up but it isn’t finding the beats binaries you intend to use…
Yes, the filebeat and journalbeat packages were installed at the same time as the Graylog sidecar.
Running versions:
Filebeat: filebeat-7.17.1-1
Journalbeat: journalbeat-7.15.2-1
which matches the working systems.
My assumption has been that there would be errors if it was a problem with the binaries and that the sidecar itself would show active on the Graylog server, regardless of beats binary status.
I did the permission check on journalbeat and everything looked ok from that perspective with no missing files/directories.
Taking a step back, what is Graylog doing to determine if a sidecar is active or inactive? Once inactive, how does Graylog determine that it is ever active again?
The firewall on the graylog server is accepting on port 9000 (and other sidecars are checking in as expected). There are no outbound firewall rules (it’s permit all) on either node.
Apparmor is not logging that it is blocking anything. I disabled it temporarily and restarted the service with no change in results. Selinux is in permissive mode on the RHEL8 server.
Thank you for suggesting tcpdump! I had not thought of that. On both nodes, there is no traffic detected to the graylog server on a service restart (or manual launch of the sidecar process), but I can see the traffic of the telnet request, so I know traffic can route and gets logged. In fact, I see no traffic going out to port 9000 at all when the service is restarted.
This definitely moves the issue firmly on the sidecar itself as the server is rightly reporting that it is not receiving data from the node.
Here’s where I get stuck though.
The configuration is correctly pointed to “https://graylog.domain.com:9000/api”
Running wget correctly returns the Graylog web page:
Well at least we know its the sidecar on the remote server.
To be honest , I’m not sure right now ( ill help looking into it). If the sidecar is started but can not connect Graylog then only a couple things right off hand I can think of.
1.the port is being block, but I think you confirmed that it is not.
2.The node that the sidecar resides on doesn’t know the FQDN of the Graylog server. Have you tried just the IP address? or adjust /etc/hosts file. Example: 192.168.1.100 graylog.domain.com
3.I think you mention permissions were good on sidecar.
4.Some Evil Graylog magic that I don’t know about
I would look into all the logs /var/log for any things that may pertain to sidecar or service that may interrupt this connection. I forgot if there is a way to set the sidecar to debugging/verbose , if so that may help out a little more.
Here is an example of mine, its for windows BUT even thou I have FQDN I used my IP Address
and made sure Graylog sidecar can access the binaries, just a thought.
I’m not sure about Apparmor but do know Selinux, When changing states from Enforce to permissive a reboot is required. Was that what you did?
It may not be the issue but just to make sure there nothing in the way.
Solved!
The missing piece for me was that the sidecar process will not check into the graylog server until it has brought up the underlying beats processes (journalbeat and filebeat in my case).
When journalbeat starts it, among other things, runs through the mounted file systems.
Both of these servers hold an NFS mount on our backend file server, but when the file server was patched and rebooted on Sunday morning, these two mounts failed to reattach themselves and got stuck (I’m still trying to figure out why they got stuck, but the terminal locked whenever I tried to issue commands that touched the stuck mount).
Performing a lazy unmount and remounting the shares freed up the disk operations, which then allowed journalbeat (and then filebeat) to startup like normal.
TL;DR: Disk operations were timing out which prevented journalbeat from starting up correctly. The sidecar process was waiting on journalbeat to come up before checking in with the Graylog server.
Thank you to both @tmacgbay and @gsmith for all the help in narrowing this down.
well, they are NFS mounts, so the solution was to remind fstab of that and make them _netdev in their options. That should make things go a little smoother if this ever happens again.