Graylog as Aggregator and ETL tool

Hi all is there a way to configure a stream not to send data to ES index and use Graylog as aggregator like fluentd?

We are doing a POC on sampling log data and We want to send data to multiple destinations after processing for example to splunk ,crateDB, ES and s3 but currently because of the underlying single ES dependency to all the streams we are seeing data drop and huge latencies seeing logs in splunk as the journal is getting filled quickly, we tried increasing the Journal buffers but that is adding more latency in seeing logs in splunk or s3.

What are we trying to do?

In our setup the Data being ingested to graylog directly is roughly around 70TB but we only want to send around 4.5TB to Splunk via pipelines after processing and around 10TB to ES and all the 70TB to S3.

Current Behavior

Graylog is able to buffer all the incoming logs but most of the data is getting dropped because we have only dedicated 10TB resources to ES per DC.

Below is detailed description of our POC setup:

We are running Graylog in three DC’s…Our current setup Per DC with heavy TCP tunings.

Graylog (7 masters 33 agents) Deafult journal Buffers running as Containers on k8’s

Mongo (HA 4 Nodes)

Elasticsearch (10 Nodes 1-master,1-balancer,8-Datanodes)

Instance Details All nodes are running on I-Flavor(Designed for Data workloads) with 16CPU and 64GB RAM

Operating System Centos 7.2

Willing to contribute back if its a viable solution:
We are in process of implementing custom stream plugin and need guidance or suggestions if we can use Graylog as aggregator and ETL tool without single ES dependency and also Is implementing custom stream plugin the right approach or are there any other alternatives?

Here is an open issue which i raised which also relates to ES Federation and using Graylog as Aggregator


Current Throughput Per DC 08%20PM

So many questions :slight_smile: This setup is waaaay over my head, so I’m pretty much not-very-helpful here.

I just have to ask: what is “I-Flavor”? Googling for it with some terms added to the query doesn’t give my useful results.

“just forward” - no it is not possible, but I think it could be a workaround.
You do a new index set, with little size based indices. If you use this index set the GL will store the messages, but only a few mins/hours. Maybe if you set ES to don’t index the index set, don’t store replica, etc…, it could be add some performance for this forward index set.

You can increase the output_batch_size parameter, and increase the ES http max size. Change it parallel, it can cause problems the wrong sizes.

If you have time, we collect the big clusters’ information. As I see you handle a lot of data, so you can tell something new for us.
If you would like, we can ask an admin to reopen the topic.

1 Like

I family instances are used for data intense workloads more details can be found here Storage Optimized Instances

Ahhh! I get it now! K versus I instances on AWS :slight_smile: And here I thought there was a whole IBM architecture that I had somehow missed out on :smiley:

Thank you!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.