Something I’m keen to understand is the difference between outgoing traffic and the size of messages in Elasticsearch.
I’ve been trying out adding a new service to Graylog and this has significantly increased the outgoing traffic.
When I export all the log messages that this new service has pushed to Graylog, the total message size is in the region of 80 MB. However, this appears to correlate to approximately 20 GB of additional outgoing traffic over the same period.
We also calculated the message size for ALL messages over a given period. The message size was 3 GB, but total outgoing traffic for the same period was around 14 GB.
There is only one Index Set configured and all streams are set to use this
On investigation, I came across this post:
Which contains the cryptic line " be warned, that the size of the messages in Elasticsearch is not equal to the outgoing traffic in Graylog (on the System page)."
Two questions:
Any thoughts on what might explain the discrepancy we are seeing?
Can anyone elaborate on the difference between message size and outgoing traffic?
We had read this, but this doesn’t actually explain why there would be a discrepancy between the size of the message coming into Graylog and the size of the message going out to Elasticsearch which is what this question is asking.
The link you provided defines outgoing traffic as: “what is written to Elasticsearch after all processing is done”
We have already confirmed there is only one Index Set and all streams use this.
This raises the question as to what additional processing might take place to enrich messages between GL and ES?
Messages are ingested into Graylog as one single string. That is after that processed in different ways:
by the codec of the input (Syslog, GELF, CEF or any other)
extractors that are defined for the input
processing pipelines for the stream the messages are in
The codec will at first make a first seperation into different fields, extractors or pipeline can split that into more specific fields or enrich the data from external sources. That would include geo data for IPs, DNS lookups or external databases/systems that are queried via REST APIs or a simple CSV Lookup.
It could also be that information from the original message are removed or anonymized. I have seen environment where the original message was removed completely and only a new generated message with some information from the original message plus information’s form a lookup are saved in elasticsearch.
I have myself in my environment some noisy messages that I can’t remove because the 3rd party vendor is either chatty or gives no information. So I remove the chatty messages to have the only important ones. But that means I remove 19 messages out of 20 to get the one important message. This drop is done in Graylog so only one message is saved to elasticsearch all other are gone.
Thanks @jan - this is all really helpful and gives some good pointers. So one form of processing would be to filter out fields/messages so we only have the useful stuff, but in this case we would expect Input > Output.
However, we are finding that the messages are getting substantially bigger, not smaller, at the point of storage.
So as an example, we looked at messages over a weekend and found that we generated 3 GB of messages going into GL, but this ended up generating 14 GB of outgoing traffic.
We see consistently that the volume of outgoing traffic is substantially larger than the size on disk of the input messages.