@rtanay as @Ponet pointed to a quite deep discussion about the usability of this release, it would be nice if you could share you experience which leads you to rolling back.
Absolutely, I’ll break it down into two categories:
Functionality
System load during search - This is possibly related to the new behavior where editing the search query triggers a new search/lack of sufficient debounce on that, but I’ve had a single user running a search with a single string across a single day’s worth of logs in a single stream pin the CPUs all 6 of my nodes. I cannot think of a time where simply searching, across any range, had any noticeable impact on load prior to 3.2. A system setting to disable autosearching on query changes would be a bandaid fix.
Processing lag - This prompted this topic about reversion. Yesterday around 3pm all 6 nodes simultaneously started processing incoming logs extremely slowly. The backlog per node maxed out around 3,500,000 unprocessed messages, resulting in a log lag at its worst of over 2 hours. While we’ve experienced log lag before under 3.1, that was only during massive log spikes. During this lag period, message volume was normal. Additionally, during previous lag/log spike incidents, all nodes’ CPUs were pinned as they worked overtime to process the backlog. This event, while processed messages fell further and further behind, load on all 6 nodes remained minimal. Load did not appreciably increase nor the backlog begin to see serious processing until 6:40pm.
Usability
Missing ‘show surrounding messages’ option - No fewer than 5 of my teammates have asked me what happened to this feature post upgrade. It was quite useful and is sorely missed.
Auto-searching on absolute date change - Trying to search for a specific period in the past brings the system to a crawl as selecting the start date immediately starts a search, while the end date is still the current date. The previous functionality where you selected both start and end periods and had to manually press enter in the search bar to run a query again eliminated unnecessarily running the query against a huge chunk of logs. Again, being able to disable this in settings would be good.
Of these issues the inexplicable log lag is the most troubling. Graylog has been an excellent tool up until this point, but several hours of “flying blind” without current logs for no apparent reason is unacceptable.