Hi,
I have more than 4 GL nodes and an ES cluster.
The problem is that I have more than +100 active streams. In the beginning I had only a few streams. As I need more streams, and pushing more logs in the system I saw it needs more CPU power. I added more graylog nodes. Now I am in the situation that if I add a new GL it does not add too much value because graylog node needs to evaluate the regexes from that +100 streams.
In the beginning each node processing over 10000 msg/sec, and now its processing ~1000 msg/sec (and its not because of elasticseach. I tested with pausing all the streams but one and the ingestion rate increased.)
There is a architecture consideration for this?
I was thinking to use a common elasticsearch cluster, and use 2 or more graylog clusters independent by each other.
Each graylog cluster to use same mongo replica set, but of course different databases; also for elasticsearch each graylog cluster to use different indices set.;. In front of them to pun an Apache to with a reverse proxy to point to a graylog cluster (eg/. www.site.org/gl1/ (for gl cluster 1) and www.site.org/gl2/ (for gl cluster 2).
Not an architecture idea, but: The first thing would be trying to optimize the regexes, or trying to use matching a string instead of a regex in the stream conditions. This is sometimes very difficult, but can make a huge difference.
For short time it is an option to optimize the stream rules. For long term it is not an option. I was thinking, if graylog can stop at first matching stream rule and not continue to check all the rules. Maybe if the stream can have a flag that tells the graylog to stop checking the remaining rules. And messages that come in the system should be checked first against this “flagged” streams rules. It is fantasy, or not ?
That’s already the case for some cases, e. g. if all rules have to match and the first rule doesn’t match, Graylog will stop evaluating rules for that stream.
Scaling has worked for us very well. I agree with the others here though. If stream rules are hurting you “tightening up” your regex to be more efficient or Matching pays off HUGE dividends once we took the time to do it and were able to get by with fewer nodes to do the same work. Another place that helped enormously is pre-formatting our outgoing logs from servers/devices/applications in gelf/JSON. Not needing cpu intensive regex extractors also makes a very noticable difference in graylog performance and needed compute resources. Just these few things go a long way when very high msg/sec inputs are in play.