Graylog journal continuously growing up


(Mihail Politaev) #1

Our Graylog instance get not say high rate messages around 1000-2000 msg/s and journal growing till i get an alert in web interface that some uncommitted messages was deleted from journal and journal utilization is too high?

Can you advise which parameters should be tweaked for handle this load? Also is there a way to determine that Graylog under high load except that i found journal is growing without size decrease.

Also need to say that ES is from 3 nodes. 1 is only witness master and 2 others is data nodes. Before we have 1 node cluster. I am create master from AMI that 1 node as well as another node. And now is some indexes at relocation between 2 data nodes, is it can impact that Elastic search can’t handle messages as fast as we sending they.

But still think that it not cause because CPU loaded under 100, even 90% on one data node and under 50% on another.
Thank you.


(Jan Doberstein) #2

Graylog Journal will raise if Elasticsearch is not able to handle the load or if the processing in Graylog took to much time so no real time processing is possible.

You should check your Logfiles of Graylog to find the reason for that.


(Mihail Politaev) #3
2017-07-24T02:56:03.305-05:00 WARN  [KafkaJournal] Journal utilization (96.0%) has gone over 95%.
2017-07-24T02:57:03.305-05:00 WARN  [KafkaJournal] Journal utilization (96.0%) has gone over 95%.
2017-07-24T02:58:03.305-05:00 WARN  [KafkaJournal] Journal utilization (97.0%) has gone over 95%.
2017-07-24T03:45:43.508-05:00 ERROR [ServerRuntime$Responder] An I/O error has occurred while writing a response message entity to the container output stream.
org.glassfish.jersey.server.internal.process.MappableException: java.io.IOException: Connection closed
        at org.glassfish.jersey.server.internal.MappableExceptionWrapperInterceptor.aroundWriteTo(MappableExceptionWrapperInterceptor.java:92) ~[graylog.jar:?]
        at org.glassfish.jersey.message.internal.WriterInterceptorExecutor.proceed(WriterInterceptorExecutor.java:162) ~[graylog.jar:?]
        at org.glassfish.jersey.message.internal.MessageBodyFactory.writeTo(MessageBodyFactory.java:1130) ~[graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime$Responder.writeResponse(ServerRuntime.java:711) [graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime$Responder.processResponse(ServerRuntime.java:444) [graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime$Responder.process(ServerRuntime.java:434) [graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:329) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:315) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:297) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:267) [graylog.jar:?]
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) [graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) [graylog.jar:?]
        at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) [graylog.jar:?]
        at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:384) [graylog.jar:?]
        at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:224) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) [graylog.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: java.io.IOException: Connection closed

This in logs Graylog. What container output stream actually mean? I see logs monitoring for I/O system fro Graylog node and it not under load, but Elasticsearch node does. Is this mean that ES node no have enough I/O throughput?

Also I see parameters:
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true

I didn’t find in Graylog documentation where this values explained. Also dont find how actually Graylog write messages into Elasticearch, even in Graylog deep dive slideshow. My guess path is Graylog -> journal -> es nodes? Am i right?

Thank you.


(Jochen) #4

(Mihail Politaev) #5

Thank you. Now is clear.

The another problem raised. When i have logged in into web interface all work is fine and so fast enough. But when i try to login from private session web browser or another user try to login web interface page loading too long time, 1 or 2 minutes.
Do you have an idea why it is that?

At time error i see an error in graylog log:

An I/O error has occurred while writing a response message entity to the container output stream.


(Scampuza) #6

We are facing the same issue in our company. When the Journal is growing and growing, our current solution for that is brute force. We add more cores and more RAM to the GL nodes, and then we change the following parameters to match the available CPU cores in the server. This workaround has worked for us.

processbuffer_processors = N cores
outputbuffer_processors = N cores


(Mihail Politaev) #7

Strange because i don’t see by CPU load average log that CPU is under load, even more 50%.


#8

If Graylog server CPU is low, but messages are not processed quickly, it is possible that the Elasticsearch cluster is not quick enough.

You can try to speed things up by:

  • setting output_batch_size to a larger value (for example 5000)
  • adding RAM to the Elasticsearch servers, and setting Elasticsearch JVM size to half of the new amount of RAM of these servers
  • if you have such high CPU utilizations (about 50%) on ES nodes, you probably have too little memory on those nodes.

(system) #9

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.