1. Describe your incident:
Starting yesterday (3/20), two of my 40ish sidecar collectors lost the ability to connect to the Graylog server (one openSUSE 15.3, one RHEL8). Both of these nodes are acting the same, but all of my troubleshooting has been focused on the openSUSE 15.3 server thus far.
The Graylog Server shows the sidecar node as inactive, and the collector (journalbeat) has an “Unknown” status with the message “Received no ping signal from sidecar(18 hours ago)”.
Focusing on the openSUSE 15.3 node, the node in question has not had any hardware changes, nor any significant software updates that our other nodes have not received. The graylog-sidecar service starts without errors, the server can ping the graylog server and telnet to port 9000, and can even curl into the API. The sidecar.yml file has two entries (server_url and server_api_token), both of which are correct and match other nodes. Despite this, the Graylog server stubbornly insists that the sidecar is inactive.
I have verified that the node-id listed in the Graylog sidecar page matches the node-id in the node-id file on the sidecar marchine.
2. Describe your environment:
OS Information: openSUSE 15.3 Linux (sidecar and server)
Service logs, configurations, and environment variables:
time=“2022-03-21T09:33:24-04:00” level=info msg=“Stopping signal distributor”
time=“2022-03-21T09:33:39-04:00” level=info msg=“Starting signal distributor”
time=“2022-03-21T09:49:11-04:00” level=info msg=“Stopping signal distributor”
time=“2022-03-21T10:18:46-04:00” level=info msg=“Starting signal distributor”
There is this error from yesterday when the Graylog server was being rebooted for kernel patches, and it appears on the working nodes as well:
time=“2022-03-20T08:41:31-04:00” level=error msg="[UpdateRegistration] Failed to report collector status to server: Put “http s://graylog.domain.com:9000/api/sidecars/9c5e38be-67b7-4ae4-a6cd-fbe00db9d9e9”: dial tcp 192.168.1.79:9000: connect: connection refused"
(space added so I can post this)
3. What steps have you already taken to try and solve the problem?
-tracelog graylog.domain.com (success, IP resolved correctly)
-telnet graylog.domain.com 9000 (success)
-curl -i -u username:password -H ‘Accept: application/json’ ‘http s://graylog.domain.com:9000/api/plugins/org.graylog.plugins.collector/configuration’ (space added so I can post this)
HTTP/1.1 204 No Content
Cache-Control: private, no-transform
- attempted to apply a filebeat configuration to the node, but it too will not deploy (presumably because Graylog is convinced that this node is inactive and unavailable)
4. How can the community help?
Trying to reconnect this inactive node back to Graylog so that the collector will start sending data again.
You can stop the sidecar service and run it manually to see what comes up - here is a normal manual startup from my ubuntu system:
$ sudo /usr/bin/graylog-sidecar
INFO Using node-id: r9999992-ab12-4444-2323-yabbadabbadoo
INFO No node name was configured, falling back to hostname
INFO Starting signal distributor
INFO Adding process runner for: filebeat
INFO [filebeat] Configuration change detected, rewriting configuration file.
INFO [filebeat] Starting (exec driver)
What version of filebeat/nxlog do you have installed? you can grab your (obfuscated please) sidecar configuration with:
cat /etc/graylog/sidecar/sidecar.yml | egrep -v "^\s*(#|$)"
Thanks for replying!
I’m not seeing anything from manually running it that didn’t show up in the log:
INFO Using node-id: 9c5e38be-67b7-4ae4-a6cd-fbe00db9d9e9
INFO No node name was configured, falling back to hostname
INFO Starting signal distributor
and it’s been sitting there for a couple minutes now, similar to what I see in the logs.
Sidecar configuration is pretty basic, with just two entries:
# cat /etc/graylog/sidecar/sidecar.yml | egrep -v "^\s*(#|$)"
server_api_token: "<<token id>>"
This is the same configuration that exists on all of my working nodes as well.
Hmmm - Not sure what is going on - here is my working config… for posterity…
server_api_token: "<<token id>>"
Unless there is something oddly missing or was updated? Maybe filebeat needs to be on the accesslist now or a local certificate expired?
That’s the fun part, the configurations match between what is working and what isn’t.
And even if a specific collector wasn’t working, I would expect the node itself to be able to check in and be active within the Graylog interface.
To confirm this, I just installed the sidecar on a new server using the exact same configuration (different node-id) and it is available in the Collectors Administration page waiting for me to deploy a collector configuration to it.
Also, it’s been an hour and my manually started sidecar has not updated its output - it is still on ‘Starting signal distributor’
It looks like I need to dig deeper into what the service does on startup since that’s where the process is breaking down.
You can ignore the collector configuration that is applied - that means nothing until you client can connect to Graylog. Have you looked at permissions on all the sidecar related files?
I have not…we let the service run as root so I didn’t think about permissions.
# rpm -ql graylog-sidecar
# ls -lash /etc/graylog/sidecar/sidecar.yml
4.0K -rw-r--r-- 1 root root 255 Sep 10 2021 /etc/graylog/sidecar/sidecar.yml
# ls -lash /usr/bin/graylog-sidecar
8.0M -rwxr-xr-x 1 root root 8.0M Mar 11 2021 /usr/bin/graylog-sidecar
# ls -lash /var/lib/graylog-sidecar/generated
0 drwxr-xr-x 1 root root 32 Sep 10 2021 .
0 drwxr-xr-x 1 root root 18 Sep 10 2021 ..
4.0K -rw------- 1 root root 432 Mar 20 05:05 journalbeat.conf
# ls -lash /var/log/graylog-sidecar
0 drwxr-xr-x 1 root root 226 Mar 20 05:05 .
0 drwxr-xr-x 1 root root 14K Mar 20 05:05 ..
1.3M -rw------- 1 root root 1.3M Mar 20 17:00 journalbeat
4.0K -rw------- 1 root root 3.8K Mar 20 05:05 journalbeat.1
8.4M -rw------- 1 root root 8.4M Mar 20 05:05 journalbeat.2
11M -rw------- 1 root root 11M Mar 16 19:25 journalbeat.3
11M -rw------- 1 root root 11M Mar 12 16:15 journalbeat.4
11M -rw------- 1 root root 11M Mar 8 14:06 journalbeat.5
11M -rw------- 1 root root 11M Mar 4 11:53 journalbeat.6
11M -rw------- 1 root root 11M Feb 28 10:10 journalbeat.7
40K -rw-r--r-- 1 root root 39K Mar 21 12:49 sidecar.log
# ls -lash /var/run/graylog-sidecar
ls: cannot access '/var/run/graylog-sidecar': No such file or directory
I have checked on a couple different working systems, and they do not have a /var/run/graylog-sidecar either.
Otherwise, the permissions look fine to me.
What about permissions on the journalbeat/filebeat directories? Those are typically installed on linux separately I think? It looks like the sidecar itself may be starting up but it isn’t finding the beats binaries you intend to use…
Yes, the filebeat and journalbeat packages were installed at the same time as the Graylog sidecar.
which matches the working systems.
My assumption has been that there would be errors if it was a problem with the binaries and that the sidecar itself would show active on the Graylog server, regardless of beats binary status.
I did the permission check on journalbeat and everything looked ok from that perspective with no missing files/directories.
Taking a step back, what is Graylog doing to determine if a sidecar is active or inactive? Once inactive, how does Graylog determine that it is ever active again?
I think that is node-id
$ cat /etc/graylog/sidecar/node-id
I am interested n the solution but I have run out of ideas to try for the moment…
Just chiming in, I look over this post. Just an Idea have you checked firewall? is Selinux/Apparmor enabled on this node in questioned?
EDIT: By chance to you do a tcpdump on Graylog? If so, did you see the remote server?
Hi gsmith, thanks for hopping in.
The firewall on the graylog server is accepting on port 9000 (and other sidecars are checking in as expected). There are no outbound firewall rules (it’s permit all) on either node.
Apparmor is not logging that it is blocking anything. I disabled it temporarily and restarted the service with no change in results. Selinux is in permissive mode on the RHEL8 server.
Thank you for suggesting tcpdump! I had not thought of that. On both nodes, there is no traffic detected to the graylog server on a service restart (or manual launch of the sidecar process), but I can see the traffic of the telnet request, so I know traffic can route and gets logged. In fact, I see no traffic going out to port 9000 at all when the service is restarted.
This definitely moves the issue firmly on the sidecar itself as the server is rightly reporting that it is not receiving data from the node.
Here’s where I get stuck though.
The configuration is correctly pointed to “https://graylog.domain.com:9000/api”
Running wget correctly returns the Graylog web page:
# wget https://graylog.domain.com:9000/api
--2022-03-21 23:47:43-- https://graylog.domain.com:9000/api
Resolving graylog.domain.com (graylog.domain.com)... 192.168.1.79
Connecting to graylog.domain.com (graylog.domain.com)|192.168.1.79|:9000... connected.
HTTP request sent, awaiting response... 307 Temporary Redirect
Location: https://graylog.domain.com:9000/ [following]
--2022-03-21 23:47:43-- https://graylog.domain.com:9000/
Reusing existing connection to graylog.domain.com:9000.
HTTP request sent, awaiting response... 200 OK
Length: 1292 (1.3K) [text/html]
Saving to: ‘api’
api 100%[==============================================================================================================================>] 1.26K --.-KB/s in 0s
2022-03-21 23:47:43 (207 MB/s) - ‘api’ saved [1292/1292]
# cat api
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="robots" content="noindex, nofollow">
<title>Graylog Web Interface</title>
<link rel="shortcut icon" href="/assets/favicon.png">
so what would make the sidecar not even try to reach out to the server?
Thanks in advance. I feel like I’m missing something obvious at this point.
Well at least we know its the sidecar on the remote server.
To be honest , I’m not sure right now ( ill help looking into it). If the sidecar is started but can not connect Graylog then only a couple things right off hand I can think of.
1.the port is being block, but I think you confirmed that it is not.
2.The node that the sidecar resides on doesn’t know the FQDN of the Graylog server. Have you tried just the IP address? or adjust
/etc/hosts file. Example:
3.I think you mention permissions were good on sidecar.
4.Some Evil Graylog magic that I don’t know about
I would look into all the logs
/var/log for any things that may pertain to sidecar or service that may interrupt this connection. I forgot if there is a way to set the sidecar to debugging/verbose , if so that may help out a little more.
Here is an example of mine, its for windows BUT even thou I have FQDN I used my IP Address
and made sure Graylog sidecar can access the binaries, just a thought.
node_id: "file:C:\\Program Files\\Graylog\\sidecar\\node-id"
- "C:\\Program Files\\Graylog\\sidecar\\winlogbeat.exe"
- "C:\\Program Files\\Metricbeat\\metricbeat.exe"
EDIT: just noticed something.
I’m not sure about Apparmor but do know Selinux, When changing states from Enforce to permissive a reboot is required. Was that what you did?
It may not be the issue but just to make sure there nothing in the way.
The missing piece for me was that the sidecar process will not check into the graylog server until it has brought up the underlying beats processes (journalbeat and filebeat in my case).
When journalbeat starts it, among other things, runs through the mounted file systems.
Both of these servers hold an NFS mount on our backend file server, but when the file server was patched and rebooted on Sunday morning, these two mounts failed to reattach themselves and got stuck (I’m still trying to figure out why they got stuck, but the terminal locked whenever I tried to issue commands that touched the stuck mount).
Performing a lazy unmount and remounting the shares freed up the disk operations, which then allowed journalbeat (and then filebeat) to startup like normal.
TL;DR: Disk operations were timing out which prevented journalbeat from starting up correctly. The sidecar process was waiting on journalbeat to come up before checking in with the Graylog server.
Thank you to both @tmacgbay and @gsmith for all the help in narrowing this down.
Well that’s an odd one - glad you found it!!! Mark your answer as a solution for future searchers!
Glad you resolved this issue
Make it permanent after reboot
root# vi /etc/fstab
/dev/sdb1 /mnt/my_repo ext4 defaults 0 0
well, they are NFS mounts, so the solution was to remind fstab of that and make them _netdev in their options. That should make things go a little smoother if this ever happens again.
192.168.1.106:/mail /mnt/fsbackup nfs defaults,_netdev,soft,vers=3 0 0
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.