1. Describe your incident:
When we are Securing Graylog with TLS, the graylog-server service does not start and the server log shows the error “Unreadable or missing HTTP private key”.
2. Describe your environment:
We’re running Graylog 6.1 and OpenSearch 2.15 in a two-server configuration on a security hardened Oracle Linux 8 OS. The certificate/key was issued by a Windows based CA.
3. What steps have you already taken to try and solve the problem?
We checked the file location/ownership/permissions per the instructions to confirm that the service user can find/access the file.
We tried encrypting the private key file on the Graylog server itself (using openssl) to avoid any MS-DOS character encoding issues.
We looked for anything in the security hardening configuration that might prevent the service from reading the file.
4. How can the community help?
How can we resolve the “Unreadable or missing HTTP private key” error? Is there something in how the certificate/key is generated that might be causing the error?
Howdy! My initial thought is that the graylog-server service cannot access the file (which is what the error says i think?). Typically this is caused by a permission issue.
I assume your graylog-server service runs as the default graylog user? You should be able to test if this user can read the file doing something like
# Use sudo to change user context to graylog
sudo -u graylog -s
cat /path/to/private.key
Can you conform the following:
The certs exist in a path that the user graylog has permission to
The file itself is owned by the user graylog
I’m not that familiar with “security hardened Oracle Linux 8 OS” configurations but do you know if there is anything that prevents services or users from accessing files? Does this hardening have any sort of logging?
Lastly, are you able to post the full line or lines from Graylog’s server.log with the error message?
Well that was kinda helpful but now we’ve got a bigger problem in that we’re completely down and can’t figure out how to fix it.
After learning that our graylog-server user could not actually access the cert and key file, we copied those files to a location where they could be accessed, modified ownership/permissions of both files, backed up our non-TLS server.conf, copied into play our TLS server.conf, and restarted the service. We were able to achieve an HTTPS login but with warnings that the cert was untrusted because it was self-signed (the cert was not self signed, it was generated by our trusted CA).
We then reverted back to our non-TLS server.conf and restarted the service, but found that we could no longer log in. When we press the Sign In button a red banner appears titled “Could not load streams” that further says “Loading streams failed with status: FetchError: There was an error fetching a resource: Unauthorized. Additional information: Not available” We are then unable to progress beyond the welcome page. We rebooted both the Graylog and OpenSearch servers, but the red banner persists.
We’re trying to undo every single change made to restore the service, but nothing is working.
We restored to a previous backup of the Graylog server and can now log in using the non-TLS configuration - now one of major streams is not reaching OpenSearch and there is a repeated ERROR entry in the Graylog server log:
[MessagesAdapterOS2] Failed to index [1] messages. Please check the index error log in your web interface for the reason. Error: failure in bulk execution
There is another minor stream which is working/searching just fine. We tried rebooting both servers but the ERROR and issue with the major stream persists.
We have now cleared all errors/warnings in the Graylog and Opensearch logs, but the major stream is still not storing messages on OpenSearch, despite it’s pipeline and output showing a great deal of activity. We’re stumped and dead in the water.
Its difficult to know how to help without knowing what changes were made
Can you clarify what this means?
the major stream is still not storing messages on OpenSearch
What stream? Are any message being output to OpenSearch?
If you truly need to get this working i do recommend to start from scratch and document the changes you are making so you can understand how to roll them back in the event they cause issues. I also do recommend to test changes in a test cluster or test environment, which can even just a single small 2 vCPU / 4 GB ram server.
We’re running two different streams on separate inputs/indexes. The minor one is about .5MB/day and the major one is about 10GB/day. The pipelines for both streams are structured similarly, though the minor stream uses only regex matching to parse fields while the major stream uses grok patterns that we created.
Currently, the minor stream is working fine and its logs are being stored/searched on OpenSearch. The major stream is currently not writing logs to OpenSearch, despite its pipeline showing normal throughput figures and the Graylog page header showing typical output figures.
There were periodic indexer failures happening when a certain sparsely occurring log subtype would come through, which we cleared by removing the pipeline rule which groks that particular log subtype.
So we’re currently seeing nothing abnormal in the Graylog or OpenSearch logs and nothing abnormal about any of the throughput/output in Graylog, but zero logs from the major stream are being written to OpenSearch. Logs previously written from this stream are searchable, but nothing new is being stored and we can’t figure out why.
This is a Proof-of-Concept system so that we can show its usefulness and gain approval to build a much larger system. We plan to keep this PoC system as a test environment when (if) we build the larger system. We lost track of changes when everything went haywire trying to make HTTPS work. We assumed keeping daily backups and separate copies of the TLS and non-TLS enabled server.conf files would be enough to revert back if something went wrong, but both of those were attempted here with mixed/unexpected results/outcomes.
If I understand, you have messages being ingested by Graylog (sent to an input), however, you cannot find those messages anywhere in Graylog, correct?
To help give you some concepts for troubleshooting, and to track down what is happening, message flow works like this:
Log/message source sends logs to a Graylog Input
Graylog Input listens on a specified port and receives input sent by source
messages are passed from input into Graylog’s processors. This could be pipeline rules and extractors
Using either Stream Rules or Pipeline Routing rules, messages are routed to specific streams as well. The stream dictates what index set the message will be stored in
Once messages are processed they are sent to Graylog’s output buffer, where they are written in batches to OpenSearch
This is what we typically refer to as a message being “indexed” meaning it has been written to OpenSearch and is then searchable by Graylog
Once messages are output to OpenSearch, Graylog can retrieve these messages by executing searching against OpenSearch
Here is also a visual overview. Originally for explaining various bottle necks but can be useful to understand how messages flow through graylog.
Can we validate that the intended log message is being sent to the Graylog input?
We can use something like tcpdump:
Replace the port number as needed
sudo tcpdump -i any -nA port 5555
Verify the metrics for the applicable input are >0 and continually increase
Verify you can see message when clicking on the “Show received messages” for the applicable input
IF the message is a syslog message being sent to a Graylog Syslog input, validate the timezone on the device sending the log message and the Graylog input are configured for the same value. Other wise messages will have an incorrect timestamp
IF you expect the message to be routed to the default stream, validate that there are no Pipeline Rules routing messages to a different stream (Screen shots here may be helpful), nor are there any pipeline rules routing the message to a different stream (Screen shots here may be helpful).
Can we verify that
some or any messages are being output to OpenSearch
message are searchable via graylog
Verify there are no indexing errors via the System / Overview page, the “Indexing & Processing Failures” section.
What I’m getting at is if this is a problem with the messages ending up in the wrong stream, ending up with the wrong timestamp, or not being written to OpenSearch at all.
For the minor stream (.5MB/day), everything is working fine and new logs are searchable. For the major stream (10GB/day), we cannot see any new messages via search (only older ones from before the HTTPS attempt), but we can clearly see that the throughput readings for that stream’s input and for its pipeline rules (and the “in/out” gauge at the top of every Graylog page) are all lit up with high activity. So we can tell that Graylog is crunching away at these logs for the major stream and is apparently outputting them, but we see no trace of them in search.
The Default Stream is empty as it should be in our configuration. There were a handful of indexer errors on the Overview page, which we remedied by removing the pipeline rule that was causing them and they have since ceased. Since there are no errors or signs of trouble in the Graylog server log or in the OpenSearch cluster log, we are at a complete loss for how to troubleshoot this.
If you do all time search by the current write index that the stream feeds into, do any further results get returned? Use below format to search per index.
I notice your search page is specifically filtering for a stream. If you remove the stream and have the search page show messages from all streams does that make any difference? Are there any identifying fields in the message you could search for when searching all streams (leaving the stream selection box empty)?
So I came in this morning after touching some grass over the weekend with an idea to try creating an alternate index for routing the logs, only to find that the problem has mysteriously disappeared and the logs from the major stream are now being indexed again.
As for the original problem with the private key, we resolved that earlier by moving the cert and key files to a different filesystem that could be accessed by the graylog-server user. It’s likely that something in the security hardening settings was preventing access in the original location.
To address the issue where the cert is not being trusted we will be re-issuing the cert in a different way and will try again.