What’s the most reliable way for sending messages to Graylog with minimum or no loss?
GELF does not support Acks, so in case of network failures, for the time until the connection problems are detected (~20 sec on windows), all sent messages will be lost. Is there anything that I am missing?
How about REST API? In case of status code 200-204, does it mean the message is successfully stored? How many HTTP log requests can a single graylog instance handle per second? (trying to compare the performance penalties of HTTP vs TCP)
Is logging using AMQP reliable? Does it use manual acks, or no? If yes, then when - before or after the message is sent to ElasticSearch? In one word is it worth sending messages to an AMQP for the sake of reliability or it’s not any better than GELF over TCP? And last, if it is, do you have a link with a good example of using it with Graylog?
If there is nothing logged for some time but one of the Graylog instances is down, will the load balancer detect this before any message is sent and thus prevent a message loss? Is there any difference in the TCP and HTTP case?
So what’s the recommended way of logging to avoid loss of entries in case of connectivity issues or servers going down?
I want to ingest from the app. Messages are generated upon certain events from a number of services and sent to Graylog. You can think of it as a normal logger (like any of the implementations of https://www.slf4j.org/) but it’s important to know if a message is lost or stored successfully.
Regarding the mixing of transport and structure, I think the protocol is important here. So it’s not TCP over UDP, but based on the higher-level protocol we can know if a message was successfully delivered. The options I listed are basically direct TCP socket connection using GELF, using HTTP (Graylog REST API, which is request/response), AMQP and a durable queue as a buffer if Graylog sends acks when reading messages from the queue (if not then, it’s the same as the direct TCP socket connection).
Let me know if more details are needed.
first you need to keep in mind 100% reliability is not possible - but you can try to come as close as possible to the 100%
Depending on your application design you need to keep in mind that syncronosly logging via TCP to Graylog can cause your application into wait - if you do maintenance on your Graylog (including the Storage Layer that is Elasticsearch) or have any other issues. So even you Loadbalance your Graylog you will have issues if Elasticsearch goes mad.
It might be better to use a queue to offload the messages from your application and then read that with Graylog. This will add another moving part to your setup that need monitoring and prop. maintenance - read attention from persons. You also need knowledge how to build them secure and high available.
Then I would use GELF as codec for your messages because your messages will be structured and no crazy regex is needed to make the data available for search, widgets or alerting. So you only need to enrich the data - if you do need that.
When you know your data you should in addition add elasticsaerch mappings for the known fields to avoid issues with the auto-guessing of field types.
This will end now from my end - if you need assistance with that, with the planing and structure I’ll be happy to help if you book professional service for that - as this is what keeps the Opensource Graylog in the game.
Thanks for sharing this info, but it does not really answer the questions I had. I made my research though and I will answer my own questions.
The problem I was asking about is how to find out if the message was delivered successfully to the cluster or not. Not sure if Java is a language you are comfortable with, but this implementation of TCP/UDP transports - https://github.com/Graylog2/gelfclient, takes part of your points into account. It supports an in-memory queue with messages and a thread pool reading the queue and sending to Graylog. If the producer is faster, eventually the queue will be full and a new message will not be accepted (trySend). If not accepted, the calling code will be informed. If there is a problem with the TCP channel or some timeout, the connection will be closed and it will try to reconnect. And this is where the problem is. A TCP packet will fail after a couple of unsuccessful re-transmissions. (On Windows, by default it is 3 re-transmissions, taking ~20 sec). Any subsequent message during that period will be converted to TCP packets and sent to the socket buffer. If the connection fails, all is lost and we have no idea what reached the Graylog servers and what not. If the protocol supported Acks from the Graylog server (not TCP acks, but GELF protocol ACKs), then we could know if the message was successful or not. Of course there are other points of failure, like my service with some messages on the queue crashes, but as you said we are trying to make the reliability as close to 100% as possible and make some trade-offs. If my Graylog cluster is scaled well, then at the moment of the crash I would lose the messages piled for around 1 sec. In the current scenario without acks, I can lose messages collected for the past 20 secs. That’s a lot.
So it was critical for me to understand what’s the implementation of the other protocols.
I reviewed quickly the code of the graylog server on GitHub and found the answers I was looking for.
HTTP (see the link above, one folder up; I am now allowed to post more than 2 links now) - returns status code 202 when the message is scheduled for processing
Processing includes retries and robust error handling, so if there is anything wrong, it will be reported in the error logs. If the server crashes apparently any messages scheduled for processing will be lost. The buffer with scheduled messages for processing is 65K elements by default. So depending on the speed of processing, up to this number of elements can be lost in case of a server crash. Assuming again that the cluster is a reasonable size and is not slower than producers, in case of a crash we lose the messages for a smaller period of time.
And one more time the difference with the direct TCP connection is that in the case of socket connection, we don’t know if the messages have reached the graylog server at all, so even in case of connectivity problems they can be lost. With the other two methods the problem is mostly if the server processing the messages goes down (assuming that the ES is HA).
The graylog use journal, and message processing. So I think it is not possible to send back an ack about the storage. The journal can overflow, and the processing can drop message (eg. Pipeline message drop).
If you want to prepare to a network problem, put a cache between the app and the network. Usually apps log to file. Eg. Filebeats can send the file to graylog. It can handle the network problems, it marks the last sent file position.
I did some stress tests before, I sent messages from one host to GL load balancer, and I haven’t faced with message loose over tcp. (If the journal was enought to store the message storm…) The searched massage number matched with the sent.
You are right for the overflow, but in this case you can send nack and this will be enough for the consumer to know that something went wrong. Or since GL does retries, you can even send ack, which indicates that GL made it’s best but a more serious problem is encountered. In HTTP and direct TCP scenario, this ack can be extended with additional info, showing if the operation was successful or not. In the case of a message queue, we should either live with just ack/nack or integrate a mechanism for notifications which can instruct interested parties that something went wrong.
If I need buffer, I would go with the message queue. Filebeat is a good point, but for me it creates additional complexity compared to the queue. But yeah, thanks for adding this option. For completeness, it’s the same as HTTP and AMQP, sends ack once the message is scheduled for processing.
As an end note, I think that support for acks on the protocol level for direct TCP communication will be very beneficial. This can be optional and be driven by customer choice during the handshake, but in any case it will prevent issues caused by connectivity problems and it will have the same level of reliability as AMQP and HTTP. It’s interesting what happens in your experiment if you put one of your GL servers down. Until your LB detects it is down, I guess, all messages sent to that instance (that went down) during this time window will be lost.
GELF is not a transport protocol, GELF is a data format. Different layers of the OSI stack.
You are right for the overflow, but in this case you can send nack and this will be enough for the consumer to know that something went wrong
The problem is that, from your point of view, you will have already received an ACK because the message was accepted by Graylog. Graylog then plonks it into its own local journal, where it can still be lost! You will not be notified of that with a renewed NACK. The NACK wouldn’t do anything anyway, because the message had already been purged on your side because of the ACK.
It’s interesting what happens in your experiment if you put one of your GL servers down. Until your LB detects it is down, I guess, all messages sent to that instance (that went down) during this time window will be lost.
That depends on the method of submission. If the load balancer points your traffic to a specific port and the port cannot be reached, then the message will not have been submitted and thus retained.
I have my excel table now…
I sent 20-30 million messages/10 min from one host over 100 paralel TCP session. No lost message.
Unfortunately I think you don’t undestand the graylog’s massage processing.
GL can send ack about the arrive of the message as I wrote.
So if you need a ack about the message storage, you need another software. (Not data format)
If you use any cache on client side, and you use TCP, and set graylog well, and use enough journal/graylog cluster for processing it won’t drop any message. (except node error)
I am not saying that it is a transport protocol (i.e. layer 4 of the OSI stack). Actually I am talking about layer 7 only. If you had ack on layer 7, then your application knows exactly when the message was received completely. Similar to HTTP, where you have responses, if GL was sending some response (just ack/nack or whatever) when using direct socket communication, then your app knows when GL has received the message and losing messages which are already sent to the socket will be prevented. As far as I see, GL’s TcpTransport class reads only and writes nothing back to the client.
Actually the idea here is if we can/want to send ack/nack only when the message is processed from the journal. This means we do not send ack on just accepting the message.
Of course I don’t know the details of how the journal is processed because, I’ve just quickly looked over the GL code. If the journal is a message queue or similar and then it’s not guaranteed which GL instance will process the message from the journal, then sending ack on the same channel will be a problem. What I suggested makes sense if the message is processed by the same instance which accepted it.
Hmm, maybe I miss something here, so please correct me if I am wrong but if we use direct TCP socket communication like https://github.com/Graylog2/gelfclient. If the LB cannot reach the port, won’t it close the TCP connection? If it does everything sent to the local socket buffer will be lost. Do I miss anything?
@macko003 a few clarification comments to be sure we are on the same page.
That load is awesome, but my concern is rather about GL instance going down for some reason. See my last comment in the answer to @Totally_Not_A_Robot. My understanding is that if we use direct TCP connection - https://bit.ly/2Ukjxf9, then if LB goes down, TCP connection will be closed. However this happens in a while and everything sent from the moment it wend down, until the problem is detected will be lost. Do I miss anything? (HTTP and AMQP) is fine.
That’s right for AMQP, FileBeat and HTTP, but I don’t think this is correct for the direct TCP communication I wrote above (the link posted above). TCP re-transmission is not enough to guarantee successful delivery of a message if you don’t have ACKs on the application level. I didn’t find those for this case.
Using FileBeat will work because it uses acks - https://bit.ly/2sSGxX4 . With direct TCP communication over a socket, I don’t see any acks on application level and this is the difference. Journaling comes into the game if we have a lot of messages. I am discussing only small load now but with failing GL nodes. See my fist comment. I think we are talking about different things.
only a lot of things…
LB/GL node goes down? who cares? You need HA.
The TCP have to handle these problems. And one more thing, one TCP connection doesn’t leave forever. Sometimes the sides close it, and open a new one. So it happends often in the background, and the layer can handle it.
If a LB/GL node goes down, you won’t get TCP ack, so it will wait for few secs, retry, and reconnect.
I don’t think a correct application drop the message after the first problem.
Don’t mix the transport and protocol layer! You talk about two different thing.
TCP ack means only the recipient get the PACKAGE and not your MESSAGE. (ok usually a syslog message can fit in a package).
It depends… I use very big journal (100GB), to handle a weekend elastic problem (that was the original plan, but now it can handle for a half day). But in normal case only a few second message amount in it.
You need to know your needs, and your bursts and make the size based on your needs.
//I prefer the full links, and don’t open the sort ones.
6:00:00 , a connection is opened, HA redirects to GL1
6:01:00, GL1 goes down, HA does not know about it just yet
6:01:00 - 6:01:10 - client sends messages, HA sends them to GL1 or GL2
6:01:10 - HA detects that GL1 is down, TCP connection to GL1 is closed (either by proxy or by client a few moment later)
6:02:00 - GL1 is up again, so client can establish a new connection
The result - all the messages redirected by HA to GL1 between 6:01:00 and 6:01:10 will be lost.
So I can’t agree with what you say. Client detects a problem 20 seconds after the problem occurred. During that interval, messages are buffered on the local socket output buffer, before the connection is reconnected. So they are lost.
You can experiment with it.
I don’t mix the protocols at all. I tried to explain a couple of times that the problem is on the application layer. TCP is not that much reliable as people tend to say. It’s reliable in a certain meaning. If data is stuck in output socket buffers and connection breaks (see above). Data is lost.
For journaling, I agree, but the problems from which I started the discussion are independent of it.
PS. I used shortened URLs because the original ones are too large, but I take the note and will use the full ones. https://unshorten.it/ can be used to check if the url is a safe one
I thought for a HA LB cluster. 1 LB is a single point of failure.
I don’t know how do you do it, but it should not loose any data. The client have to handle this.
Maybe with lower timeout and more often check you can decrease this time. But I have no experince with HAproxy, I changed it after one day.
//You can check the sort links and open it if the domain doesn’t rejected by your companies DNS
LB HA is fine, but unfortunately load balancing TCP traffic is trickier. Decreasing the timeout for detecting inactive instances definitely helps, but does not completely solve the problem. That’s why I said adding ack on the application layer will be very useful.
Anyway, thanks for your help. I think I have now much clearer picture. There is stuff that can be added to the Graylog server but I guess there are priorities
P.S. thanks for pointing the problem with the short urls. It didn’t come up to my mind that the domain can be rejected