Incoming is good but outgoing is very slow


(Tafsir) #1

Hi @jan

I can see on some of my data nodes that incoming of messages is good but outgoing is very slow. Because of this backlog is there on my data node.

Will you please help me to know why this is happening.


(Jan Doberstein) #2

I just guess you need more ressources on Elasticsearch to keep up with the ingested messages.


(Tafsir) #3

which type of resources @jan


(Jan Doberstein) #4

Just more computing power to the elasticsearch cluster - it is not fast enought to process and store the messages. Without any additional information it can only be a guess.


#5

Check your elasticsearch server(s) for bottleneck.
First the disk, after the CPU.
Usually it can’t write out the data, or can’t process it.
You can also check the memory usage. Not just the OS memory, the Elastic heap usage too.
Or maybe your network can’t transfer the data to ES servers.


(Markuchi) #6

If you are sure that your servers have sufficient resources what we found was the following settings helped ensure the output could keep up.

ring_size
outputbuffer_processors
processbuffer_processors
output_batch_size

Make sure you read up on the documentation around these before adjusting.
Output batch size going to large seems to cause some instability. Doubling the default was fine for us but when going much higher would get much more throughput but had instability with greylog.

Also we increased our journal size to handle cases where there were prolonged influx of data or ES had to go down for awhile.


(Tafsir) #7

ok will check @macko003

Heap usage is 22.8 GB - Is this sufficient to handle the messages … ?

Thanks


(Tafsir) #8

Hi @markuchi,

Where can I find these settings…?


(Tafsir) #9

@markuchi @macko003 @jan

The recent pic shows the backlog on my data node. Outgoing of messages is very slow.

when I saw the server log I found this

2018-09-20T19:00:37.503Z ERROR [SystemJobManager] Unhandled error while running SystemJob <1414ccf0-bcff-11e8-a932-1244b83420c4> [org.graylog2.indexer.indices.jobs.OptimizeIndexJob]
org.graylog2.indexer.ElasticsearchException: Couldn't force merge index graylog_550
	at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:51) ~[graylog.jar:?]
	at org.graylog2.indexer.indices.Indices.optimizeIndex(Indices.java:629) ~[graylog.jar:?]
	at org.graylog2.indexer.indices.jobs.OptimizeIndexJob.execute(OptimizeIndexJob.java:71) ~[graylog.jar:?]
	at org.graylog2.system.jobs.SystemJobManager$1.run(SystemJobManager.java:89) [graylog.jar:?]
	at com.codahale.metrics.InstrumentedScheduledExecutorService$InstrumentedRunnable.run(InstrumentedScheduledExecutorService.java:235) [graylog.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_141]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_141]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_141]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_141]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_141]
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_141]
	at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_141]
	at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_141]
	at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:137) ~[graylog.jar:?]
	at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:153) ~[graylog.jar:?]
	at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:282) ~[graylog.jar:?]
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138) ~[graylog.jar:?]
	at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56) ~[graylog.jar:?]
	at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:259) ~[graylog.jar:?]
	at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:163) ~[graylog.jar:?]
	at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165) ~[graylog.jar:?]
	at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:273) ~[graylog.jar:?]
	at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:125) ~[graylog.jar:?]
	at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[graylog.jar:?]
	at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[graylog.jar:?]
	at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111) ~[graylog.jar:?]
	at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[graylog.jar:?]
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[graylog.jar:?]
	at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[graylog.jar:?]
	at io.searchbox.client.http.JestHttpClient.executeRequest(JestHttpClient.java:151) ~[graylog.jar:?]
	at io.searchbox.client.http.JestHttpClient.execute(JestHttpClient.java:77) ~[graylog.jar:?]
	at org.graylog2.indexer.cluster.jest.JestUtils.execute(JestUtils.java:46) ~[graylog.jar:?]
	... 11 more
2018-09-20T19:19:18.844Z WARN  [StreamRouterEngine] Unexpected error during stream matching
java.util.concurrent.TimeoutException: null
	at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[?:1.8.0_141]
	at com.google.common.util.concurrent.SimpleTimeLimiter.callWithTimeout(SimpleTimeLimiter.java:147) ~[graylog.jar:?]
	at org.graylog2.streams.StreamRouterEngine$Rule.matchWithTimeOut(StreamRouterEngine.java:303) [graylog.jar:?]
	at org.graylog2.streams.StreamRouterEngine$Rule.access$000(StreamRouterEngine.java:264) [graylog.jar:?]
	at org.graylog2.streams.StreamRouterEngine.match(StreamRouterEngine.java:191) [graylog.jar:?]
	at org.graylog2.streams.StreamRouter.route(StreamRouter.java:99) [graylog.jar:?]
	at org.graylog2.filters.StreamMatcherFilter.filter(StreamMatcherFilter.java:42) [graylog.jar:?]
	at org.graylog2.messageprocessors.MessageFilterChainProcessor.process(MessageFilterChainProcessor.java:100) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.handleMessage(ProcessBufferProcessor.java:114) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.dispatchMessage(ProcessBufferProcessor.java:100) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:77) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:42) [graylog.jar:?]
	at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
	at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
2018-09-20T19:19:18.845Z WARN  [StreamRouterEngine] Unexpected error during stream matching
java.util.concurrent.TimeoutException: null
	at java.util.concurrent.FutureTask.get(FutureTask.java:205) ~[?:1.8.0_141]
	at com.google.common.util.concurrent.SimpleTimeLimiter.callWithTimeout(SimpleTimeLimiter.java:147) ~[graylog.jar:?]
	at org.graylog2.streams.StreamRouterEngine$Rule.matchWithTimeOut(StreamRouterEngine.java:303) [graylog.jar:?]
	at org.graylog2.streams.StreamRouterEngine$Rule.access$000(StreamRouterEngine.java:264) [graylog.jar:?]
	at org.graylog2.streams.StreamRouterEngine.match(StreamRouterEngine.java:191) [graylog.jar:?]
	at org.graylog2.streams.StreamRouter.route(StreamRouter.java:99) [graylog.jar:?]
	at org.graylog2.filters.StreamMatcherFilter.filter(StreamMatcherFilter.java:42) [graylog.jar:?]
	at org.graylog2.messageprocessors.MessageFilterChainProcessor.process(MessageFilterChainProcessor.java:100) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.handleMessage(ProcessBufferProcessor.java:114) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.dispatchMessage(ProcessBufferProcessor.java:100) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:77) [graylog.jar:?]
	at org.graylog2.shared.buffers.processors.ProcessBufferProcessor.onEvent(ProcessBufferProcessor.java:42) [graylog.jar:?]
	at com.lmax.disruptor.WorkProcessor.run(WorkProcessor.java:143) [graylog.jar:?]
	at com.codahale.metrics.InstrumentedThreadFactory$InstrumentedRunnable.run(InstrumentedThreadFactory.java:66) [graylog.jar:?]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]


#10

@Tafsir_Alam
Check the heap usage in %, over 80% could be problems, but you can check the elastic recommendations.
In logs are stream process timeout, so check or temp. disable your stream rules. When I used regexp based stream rules the GL disabled the stream because it can’t hanle it in time.

What do you see when you click on the node name under system/nodes?
What buffer full in your server?

If you try to play with parameters what @markuchi suggested, double check the documentation, and change it if you sure what are you doing.
You found these parameters in GL’s server.conf
If you increase the output_backh_size, you need increase the elastic bulk_max_body_size too.


(Zero) #11

What is your infrastructure? Are you using dedicated Kafka? To me it is strange that one of your nodes is processing 6000 messages while 2 other nodes are doing nothing - that’s not how it should look like.

I had quite the same situation with 3 Graylog nodes having 1 node process 90% of messages and 2 other nodes doing almost nothing. Souluton was quite simle: I had 3 Kafka’s, in Graylog => System => Inputs I set 3 Threads for an input but that was a big mistake. For 3 Kafkas (3 replication 3 partition) and 3 Graylogs I had to set 1 Thead per Input and it works like a charm, every node doing around 33% of work :slight_smile:


#12

I installed a load balancer as frontend, so it handle the almost same message number for my servers.


(Zero) #13

Try setting Processor Thread = 1 for every input in SYSTEM => INPUTS , More actions => Edit Input.
Also, restarting Graylogs.


(Tafsir) #14

@amitshar04 ^^ Hope the above messages will help us to solve backlog problem.


(system) #15

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.