Sidecar not starting wrappers

We seem to have an issue with both the latest version of sidecar and the original (1.1.0 and 1.0.1).

The sidecar wrappers are not starting automatically after a client reboot. The clients are Windows 10 21H2. When you look in services after a client boots the Graylog Sidecar service is running but all the beats wrappers are set to Manual but haven’t been started by the Sidecar service. The Graylog sidecar management interface shows all the wrappers as green arrows (running).

In the test which results in the below log the machine was booted at around 9:40 and a user logs in. There have been no changes to the configuration files for Beats in the Sidecar configuration. At 9:50 I triggered a manual restart of the Metricbeat configuration process through the Graylog interface.

The resulting logs are:

time="2022-09-16T09:40:21+01:00" level=info msg="Starting signal distributor" 
time="2022-09-16T09:40:32+01:00" level=info msg="Adding process runner for: filebeat" 
time="2022-09-16T09:40:32+01:00" level=info msg="Adding process runner for: metricbeat" 
time="2022-09-16T09:40:32+01:00" level=info msg="Adding process runner for: winlogbeat_client" 
time="2022-09-16T09:40:32+01:00" level=info msg="[winlogbeat_client] Configuration change detected, rewriting configuration file." 
time="2022-09-16T09:40:32+01:00" level=info msg="[filebeat] Configuration change detected, rewriting configuration file." 
time="2022-09-16T09:40:32+01:00" level=info msg="[metricbeat] Configuration change detected, rewriting configuration file." 
time="2022-09-16T09:45:36+01:00" level=info msg="Starting signal distributor" 
time="2022-09-16T09:45:46+01:00" level=info msg="Adding process runner for: metricbeat" 
time="2022-09-16T09:45:46+01:00" level=info msg="Adding process runner for: winlogbeat_client" 
time="2022-09-16T09:45:46+01:00" level=info msg="Adding process runner for: filebeat" 
time="2022-09-16T09:45:46+01:00" level=info msg="[metricbeat] Configuration change detected, rewriting configuration file." 
time="2022-09-16T09:45:46+01:00" level=info msg="[winlogbeat_client] Configuration change detected, rewriting configuration file." 
time="2022-09-16T09:45:46+01:00" level=info msg="[filebeat] Configuration change detected, rewriting configuration file." 
time="2022-09-16T09:49:31+01:00" level=info msg="Stopping signal distributor" 
time="2022-09-16T09:49:31+01:00" level=info msg="[metricbeat] Stopping" 
time="2022-09-16T09:49:31+01:00" level=info msg="[filebeat] Stopping" 
time="2022-09-16T09:49:31+01:00" level=info msg="[winlogbeat_client] Stopping" 
time="2022-09-16T09:49:31+01:00" level=error msg="[metricbeat] Could not send stop control: The service has not been started." 
time="2022-09-16T09:49:31+01:00" level=error msg="[filebeat] Could not send stop control: The service has not been started." 
time="2022-09-16T09:49:31+01:00" level=error msg="[winlogbeat_client] Could not send stop control: The service has not been started." 
time="2022-09-16T09:49:33+01:00" level=info msg="Starting signal distributor" 
time="2022-09-16T09:49:43+01:00" level=info msg="Adding process runner for: metricbeat" 
time="2022-09-16T09:49:43+01:00" level=info msg="Adding process runner for: winlogbeat_client" 
time="2022-09-16T09:49:43+01:00" level=info msg="Adding process runner for: filebeat" 
time="2022-09-16T09:49:44+01:00" level=info msg="[metricbeat] Configuration change detected, rewriting configuration file." 
time="2022-09-16T09:49:44+01:00" level=info msg="[winlogbeat_client] Configuration change detected, rewriting configuration file." 
time="2022-09-16T09:49:44+01:00" level=info msg="[filebeat] Configuration change detected, rewriting configuration file." 
time="2022-09-16T09:50:14+01:00" level=info msg="[metricbeat] Got remote restart command" 
time="2022-09-16T09:50:14+01:00" level=info msg="[metricbeat] Starting (svc driver)" 

Restarting the individual configuration processes through the Graylog interface kicks things into life and logs begin to flow (09:50).

I think it may be to do with TLS, our Graylog instance is HTTPS but our Beats input and Beats agents are not configured for TLS. Could this cause this issue?

Can you post your sidecar.yml for the Win10 machine (Obfuscated and default commenting not needed of course) below are mine on my win10 machine just for comparison (NOTE: yml is case/space sensitive)

You can run C:\Program Files\Graylog\sidecar\graylog-sidecar.exe -debug at the administrative command line to get a view into what its doing - maybe post that as well.

server_url: http://billion-core:9000/api/
server_api_token: "<random-stuff>" 
update_interval: 10
tls_skip_verify: true
send_status: true
list_log_files:
collector_id: file:C:\Program Files\Graylog\sidecar\collector-id
cache_path: C:\Program Files\Graylog\sidecar\cache
log_path: C:\Program Files\Graylog\sidecar\logs
log_rotation_time: 86400
log_max_age: 604800
tags: [windows]
collector_binaries_accesslist: []
backends:
    - name: nxlog
      enabled: false
      binary_path: C:\Program Files (x86)\nxlog\nxlog.exe
      configuration_path: C:\Program Files\Graylog\sidecar\generated\nxlog.conf
    - name: winlogbeat
      enabled: true
      binary_path: C:\Program Files\Graylog\sidecar\winlogbeat.exe
      configuration_path: C:\Program Files\Graylog\sidecar\generated\winlogbeat.yml
    - name: filebeat
      enabled: true
      binary_path: C:\Program Files\Graylog\sidecar\filebeat.exe
      configuration_path: C:\Program Files\Graylog\sidecar\generated\filebeat.yml
    - name: auditbeat
      enabled: false
      binary_path: C:\Program Files\Graylog\sidecar\auditbeat.exe
      configuration_path: C:\Program Files\Graylog\sidecar\generated\auditbeat.yml

Hello @nick

Adding on to @tmacgbay Post.
By chance does you MetricBeat service look something like this?

image

The services aren’t listed in services.msc until I manually push a restart/start from Graylog. When they appear they are listed as Manual (which I think is correct as they’re controlled by the Graylog Sidecar service). The Graylog Sidecar service is always running and set to Automatic.

The Beats configs are correct, they seem to be running ok on Server 2019 boxes (although we’re just running through checking they are actually running) and when the services are manually restarted the logs appear in Graylog. I think this rules out the Beats configs as being an issue.

In the sidecar log on the clients I can see an attempt to rewrite the config every few minutes and a lot of errors about Sidecar not being able to report status to Graylog - connection actively refused. That would explain the Graylog status page showing the service as running when it’s not (although I’d expect Graylog to show an unknown status for the service at that point). It doesn’t explain why the collector services aren’t being started by the Sidecar service though.

The Beats input is running on port 5044 and is not configured for TLS, the Graylog web interface is configured for TLS.

sidecar debug shows that the application has detected a config change but the config files don’t seem to have been downloaded.

It also shows “Failed to get registered services, skipping cleanup, access denied”.

I’m wondering whether the graylog-sidecar services requires some permissions its missing to read and set service states?

Hello,
By chance did you execute these commands here on Windows?

"C:\Program Files\graylog\sidecar\graylog-sidecar.exe" -service install
"C:\Program Files\graylog\sidecar\graylog-sidecar.exe" -service start

And if you have errors stating the following…

Sidecar not being able to report status to Graylog -

“Failed to get registered services, skipping cleanup, access denied”

I think the Graylog Sidecar wasn’t install correctly or may have missed some settings/configurations.

Thanks @gsmith

All installed according to the documentation, we have sidecar installed on our server infrastructure as well and that doesn’t seem to have an issue.

As far as I’m aware there are no configurations to miss. We install our custom sidecar.yml (which matches the server ones) after the installation and the sidecar appears in the Graylog interface.

I have now noticed that Graylog fails to push new configs down as well, it definitely looks like a permissions issue to start with. The only change we’ve made since installing sidecar a couple of years ago is the transition from Windows 10 2004 to 21H2, I wonder if the new Windows version broke something silently.

It’s a bit annoying that Graylog is showing all the services as running when they’re not. If a sidecar doesn’t report status surely they should report as Unknown?

Ok, I’m working on this issue again today. For background, our installation process for Graylog sidecar is an xcopy of the Graylog folder and then a powershell command to do the -service install. We then reboot during the build process so starting the service isn’t required (as it starts automatically on reboot). There are no custom permissions set on the Graylog folder structure. The installation is carried out by the local administrator account within Microsoft Deployment Toolkit.

There appear to be two related issues found when running the debug tool. One is Access Denied for sidecar to get Service statuses and the other is Access Denied when trying to write new log files.

Here are the three errors that are in constant repeat within the debug log:

level=debug msg="[RequestConfiguration] No Configuration update available, skipping update" Failed to fire hook = Can't rename log file: rename C:\Program Files\Graylog\sidecar\logs\sidecar.log C:\Program Files\Graylog\sidecar\logs\sidecar-2022-09-22T09-00-26.650.log

level=warning msg="Failed to get registered services. Skipping Cleanup. Access is denied." Failed to fire hook = Can't rename log file: rename C:\Program Files\Graylog\sidecar\logs\sidecar.log C:\Program Files\Graylog\sidecar\logs\sidecar-2022-09-22T09-00-27.827.log

level=debug msg="[RequestBackendList] No update available" Failed to fire hook = Can't rename log file: rename C:\Program Files\Graylog\sidecar\logs\sidecar.log C:\Program Files\Graylog\sidecar\logs\sidecar-2022-09-22T09-00-28.534.log

Hello

I would not be surprised, dealing with Windows.

level=debug msg="[RequestConfiguration] No Configuration update available, skipping update" Failed to fire hook = Can't rename log file: rename C:\Program Files\Graylog\sidecar\logs\sidecar.log C:\Program Files\Graylog\sidecar\logs\sidecar-2022-09-22T09-00-26.650.log

So permission issue for renaming the log file :thinking:

By chance on Graylog-Sidecar conf file is there any configuration made like this?

# Default:
collector_binaries_accesslist:
#  - "C:\\Program Files\\Graylog\\sidecar\\filebeat.exe"
   - "C:\\Program Files\\Graylog\\sidecar\\winlogbeat.exe"
#  - "C:\\Program Files\\Filebeat\\filebeat.exe"
#  - "C:\\Program Files\\Packetbeat\\packetbeat.exe"
   - "C:\\Program Files\\Metricbeat\\metricbeat.exe"

Check the file/folder privileges… that seems odd that it can’t rotate a log file. You set the service account to start with an account with better privileges’ as a test as well…

Can you post the sidecar.yml that you are using? (Obfuscated)

1 Like

Permissions on the local folder seem correct. System has full access and services are running as Local System.

sidecar.yml

server_url: "https://graylogserver:9000/api"
server_api_token: "token"
node_id: "file:C:\\Program Files\\Graylog\\sidecar\\node-id"
node_name: ""
update_interval: 10
tls_skip_verify: false
send_status: true

collector_binaries_whitelist:
- "C:\\Program Files\\Graylog\\sidecar\\beats\\filebeat\\filebeat.exe"
- "C:\\Program Files\\Graylog\\sidecar\\beats\\metricbeat\\metricbeat.exe"
- "C:\\Program Files\\Graylog\\sidecar\\beats\\winlogbeat\\winlogbeat.exe"

metricbeat.conf (as an example)

fields_under_root: true
fields.collector_node_id: ${sidecar.nodeName}
fields.gl2_source_collector: ${sidecar.nodeId}

queue:
  mem:
    events: 4096
    flush.min_events: 0
    flush.timeout: 1s

#----------------------------- Logstash output ---------------------------------
output.logstash:
  # Boolean flag to enable or disable the output module.
  enabled: true

  # The Logstash hosts
  hosts: ["${user.grayloghostname}:5044"]

#----------------------------------- Modules -----------------------------------

metricbeat.modules:

#-------------------------------- System Module --------------------------------
- module: system
  metricsets:
    - cpu             # CPU usage
    #- load            # CPU load averages
    - memory          # Memory usage
    - network         # Network IO
    #- process         # Per process metrics
    #- process_summary # Process summary
    - uptime          # System Uptime
    #- socket_summary  # Socket summary
    #- core           # Per CPU core usage
    - diskio         # Disk IO
    - filesystem     # File system usage for each mountpoint
    #- fsstat         # File system summary metrics
    #- raid           # Raid
    #- socket         # Sockets and connection info (linux only)
    #- service        # systemd service information
  enabled: true
  period: 10s
  processes: ['.*']

  # Configure the metric types that are included by these metricsets.
  cpu.metrics:  ["percentages","normalized_percentages"]  # The other available option is ticks.
  core.metrics: ["percentages"]  # The other available option is ticks.

All variables, server urls and tokens are correct (but obfuscated in the above examples).

I’m also seeing fairly regular (though not constant) errors in the log like this:

[UpdateRegistration]Failed to report collector status to server: put \"https://graylogserver:9000/api/sidecars/nodeid\": dial tcp lookup graylogserver :no such host"

Our DNS entries are correct and the graylog server responds to ping from the clients having issues.

I’ve updated the clients to sidecar 1.2.0 as there were a couple of fixes that looked like they may cause excess log issues but my key issues remain.

  • Beats services aren’t started automatically by the sidecar service (though I can remotely start the service through the Graylog interface).
  • Beats config file updates are not being distributed successfully automatically or manually from the Graylog interface.
  • Log files don’t seem to be correctly writing, the individual beats logfiles within the logs folder aren’t created. Sidecar.log is updated but the logs show errors attempting to rename it by the sidecar service.

When config files are manually applied and the services are manually started log files are sent and appear in Graylog.

Oh and it’s worth saying I installed a brand new Graylog instance (latest version) and a Windows 10 2004 client with Sidecar v1.2.0.

Everything works fine but the error:

level=warning msg="Failed to get registered services. Skipping Cleanup. Access is denied."

is still there every 10 seconds or so in the logs. This looks like a bug/default behavior of sidecar (so may not be the source of the issue).

Certainly not normal behavior… Not sure if it is with your copy/paste on posting but your sidecar.yml does not have the correct indentation. yml files are very specific about indentation and syntax. It is also not in the default location (at least not historically…)

you have posted:

collector_binaries_whitelist:
- "C:\\Program Files\\Graylog\\sidecar\\beats\\filebeat\\filebeat.exe"
...

when it should look like this with indentation and default location:

collector_binaries_accesslist:
   - "C:\\Program Files\\Graylog\\sidecar\\filebeat.exe"
...

You likely can move it’s location but you need to make sure the “backends” ( you have a backends section, right?!?) section matches that path and possibly some other things (I haven’t moved it). I may be that that is the new path for beats though…

1 Like

Hi @tmacgbay, This is the default location in our environment as the latest beats packages are large folders rather than just an .exe. The location of the .exe’s does match the .yml.

I don’t indent the binaries list, it seems to work fine. This configuration is used across our estate and is working fine in all but this subsection.

I don’t think I do have a backends section (unless that’s a term for something else)? I’m using Graylog 3.3.16 if that makes a difference? Remember that in the main, within our environment most logs are reaching Graylog with no problems. I only have a subset of machines which are having problems.

The Failed to get registered services message seems to happen on default installs so it would be useful for a Graylog staff member to comment on this, I’ll raise a bug on Github if it seems unusual or if someone could replicate?

I have had winlog beats working on win10 for some time now, most likely as far back as the 3.x days so that shouldn’t make a difference. Not sure what you mean by “Failed to get registered services message seems to happen on default installs” Is that with your default configured sidecar.yml file that they fail? Graylog Staff members occasionally peruse and may comment, but it is rare on the community side. It doesn’t seem like it would be a bug to post to Github if others are able to get it to work…

If you have win10 firewall on (or something third party), you may need to poke holes in that. Try some experimentation with sidecar.yml on the machines that don’t work and some of the configuration difference posted above. Not sure where the sidecar logs are set to go in your configurations (I don’t see that part posted) but examining the sidecar logs usually gives enough detail to figure out what is going on. Since you have made some changes, you can post sidecar.log results if you like.

Hello @nick

Not sure but I would look here incase the version of graylog you have may not support the version of Graylog sidecar that was installed.

1 Like

@tmacgbay I’ve had beats working with Windows clients too, over the last few years, this seems to be a recent issue.

To explain the Failed to get registered services error. I have installed a fresh copy of Graylog and a fresh Windows 10 client (new downloads, not from our media library), both fully up to date. With default settings and default sidecar.yml the debug log from Sidecar shows the “Failed to get registered services” error. This points towards a bug rather than a configuration issue I’d have thought. Logs are still passed and nothing is entered into the sidecar.log file. Unless you have debug on for Sidecar you’ll probably never notice the issue.

@gsmith Graylog 3.3.16 supports Sidecar 1.2.0 (1.2.x | 3.2.5 or higher) according to the documentation.

The main problem is that we thought our clients were working correctly, a lot of our clients/servers are fine so Graylog is still receiving logs (around 350,000 a day). The Graylog Sidecar interface is also showing all clients have Running processes and applied configurations. It’s only when troubleshooting a different client issue we found logs for that client were missing in Graylog. It’s then we found that not all our Sidecar instances are working correctly (collector services not running). As far as you can tell in Graylog everything looks ok. In this specific failure mode it seems that Graylog believes the Sidecar instances are working and the services are running (even though they’re not). I would expect the Sidecar status in Graylog to default to Unknown if it doesn’t receive the services status from the sidecar client but it seems to stay on the last received status (in our case Running).

Have you tried setting up a test sidecar.yml that mirrors the one I posted but has settings for your environment?

You can post in Github for Sidecar but it would be a challenge to get the developers to consider a 3.x version of Graylog for issues there. Where do you stand on your upgrade process?

Yes, same issues, it’s definitely not a .yml config or .conf config error. We have the same configurations working in other parts of the environment.

The Sidecar “Failed to get registered services error” is on the latest version of Graylog 4 with the latest Sidecar v 1.2.0 as well.

We need to review the license changes from 3.x.x to 4.x.x and upgrading to OpenSearch isn’t a small undertaking!

Looking back in my log I had a similar message once No other detail other than the machine was short on memory (Mine usually is but thats another story) I will post it up for posterity. Di you look at Win10 firewall? Throwing out to see what sticks.

time="2022-08-26T13:11:27-04:00" level=info msg="[winlogbeat] Configuration change detected, rewriting configuration file." 
time="2022-08-26T13:11:28-04:00" level=info msg="[winlogbeat] Starting (svc driver)" 
time="2022-09-01T00:43:28-04:00" level=warning msg="Failed to get registered services. Skipping cleanup. Not enough memory resources are available to complete this operation." 
time="2022-09-08T17:11:08-04:00" level=info msg="Starting signal distributor" 
time="2022-09-08T17:11:18-04:00" level=info msg="Adding process runner for: winfilebeat"