Thanks for sharing this info, but it does not really answer the questions I had. I made my research though and I will answer my own questions.
The problem I was asking about is how to find out if the message was delivered successfully to the cluster or not. Not sure if Java is a language you are comfortable with, but this implementation of TCP/UDP transports - https://github.com/Graylog2/gelfclient, takes part of your points into account. It supports an in-memory queue with messages and a thread pool reading the queue and sending to Graylog. If the producer is faster, eventually the queue will be full and a new message will not be accepted (trySend). If not accepted, the calling code will be informed. If there is a problem with the TCP channel or some timeout, the connection will be closed and it will try to reconnect. And this is where the problem is. A TCP packet will fail after a couple of unsuccessful re-transmissions. (On Windows, by default it is 3 re-transmissions, taking ~20 sec). Any subsequent message during that period will be converted to TCP packets and sent to the socket buffer. If the connection fails, all is lost and we have no idea what reached the Graylog servers and what not. If the protocol supported Acks from the Graylog server (not TCP acks, but GELF protocol ACKs), then we could know if the message was successful or not. Of course there are other points of failure, like my service with some messages on the queue crashes, but as you said we are trying to make the reliability as close to 100% as possible and make some trade-offs. If my Graylog cluster is scaled well, then at the moment of the crash I would lose the messages piled for around 1 sec. In the current scenario without acks, I can lose messages collected for the past 20 secs. That’s a lot.
So it was critical for me to understand what’s the implementation of the other protocols.
I reviewed quickly the code of the graylog server on GitHub and found the answers I was looking for.
- AMQP - https://goo.gl/LKMNrY - sends ack once the message is accepted and processing is scheduled
- HTTP (see the link above, one folder up; I am now allowed to post more than 2 links now) - returns status code 202 when the message is scheduled for processing
Processing includes retries and robust error handling, so if there is anything wrong, it will be reported in the error logs. If the server crashes apparently any messages scheduled for processing will be lost. The buffer with scheduled messages for processing is 65K elements by default. So depending on the speed of processing, up to this number of elements can be lost in case of a server crash. Assuming again that the cluster is a reasonable size and is not slower than producers, in case of a crash we lose the messages for a smaller period of time.
And one more time the difference with the direct TCP connection is that in the case of socket connection, we don’t know if the messages have reached the graylog server at all, so even in case of connectivity problems they can be lost. With the other two methods the problem is mostly if the server processing the messages goes down (assuming that the ES is HA).