Error in Graylog3.0 heap size reaching max limit and all messages are in Process buffer started getting socket time out exception

Hi All,

We are running graylog 3.0 in Azure VM and Elasticsearch 6.8.1 in another separate Azure VM. The compute of Graylog & Elasticsearch VM has 4VCPUs, 16GB RAM, 6400 MAX IOPS

We have allocated graylog node heap space 8GB and elasticsearch node 8GB.

Now we started processing IIS logs to our environment via filebeat to beats port (5044) and data is coming to stream and below is the sample IIS message,

2019-07-29 10:02:54 W3SVC1240477980 RD00155DF24AE6 192.168.151.22 POST /Common/GetUnReadEmailTransactionByUser - 443 - 203.90.4.235 HTTP/2.0 Mozilla/5.0+(Windows+NT+10.0;+Win64;+x64)+AppleWebKit/537.36+(KHTML,+like+Gecko)+Chrome/75.0.3770.142+Safari/537.36 _ga=GA1.2.447190142.1534840655;+intercom-id-w7xf88vf=31273993-925e-416f-928d-cc56564fb546;+intercom-id-k1ocnq4d=03fdbaa2-39d6-4378-b54f-a6f02bec51df;+_fbp=fb.1.1558210096632.942788230;+hubspotutk=a70a5af60db08860f01ca8db086e7f85;+__hstc=47073656.a70a5af60db08860f01ca8db086e7f85.1558210098490.1558210098490.1560748468343.2;+_gid=GA1.2.112440141.1564301802;+loginUrl=mlrmc;+ASP.NET_SessionId=o3ouwe2m5f3o3g4hn5tl4nbn https://graylog.com/ChangeLog/Index graylog.com 200 0 0 688 932 78

Since the message is not JSON strutured so we have written grok pattern and added in graylog,

%{TIMESTAMP_ISO8601:iis.access.time} %{NOTSPACE:iis.access.hostname} %{NOTSPACE:iis.access.server_name} %{IPORHOST:iis.access.server_ip} %{WORD:iis.access.method} %{URIPATH:iis.access.url} %{NOTSPACE:iis.access.query_string} %{NUMBER:iis.access.port} %{NOTSPACE:iis.access.user_name} %{IPORHOST:iis.access.remote_ip} HTTP/%{NUMBER:iis.access.http_version} %{NOTSPACE:iis.access.useragent} %{NOTSPACE:iis.access.cookie} %{NOTSPACE:iis.access.referrer} %{NOTSPACE:iis.access.site_name} %{NUMBER:iis.access.response_code} %{NUMBER:iis.access.sub_status} %{NUMBER:iis.access.win32_status} %{NUMBER:iis.access.body_sent.bytes} %{NUMBER:iis.access.body_received.bytes} %{NUMBER:iis.access.request_time_ms}

In graylog we have created new pipeline and rules of it. we then configured filebeat stream to the pipeline and we started processing message and we got around avg 800 messages per minute to graylog. But within 10 minutes graylog becomes unstable and heap usage reached to maximum limit and all the messages were going to process buffer and no messages were processed to elasticsearch, now all the messages are in graylog journal.

Below is the response in server.log

ubuntu@graylog3:~$ sudo tail -f /var/log/graylog-server/server.log
2019-07-29T06:53:43.696Z INFO  [NetworkListener] Started listener bound to [0.0.0.0:9000]
2019-07-29T06:53:43.698Z INFO  [HttpServer] [HttpServer] Started.
2019-07-29T06:53:43.698Z INFO  [JerseyService] Started REST API at <0.0.0.0:9000>
2019-07-29T06:53:43.698Z INFO  [ServiceManagerListener] Services are healthy
2019-07-29T06:53:43.699Z INFO  [InputSetupService] Triggering launching persisted inputs, node transitioned from Uninitialized?[LB:DEAD] to Running?[LB:ALIVE]
2019-07-29T06:53:43.699Z INFO  [ServerBootstrap] Services started, startup times in ms: {GracefulShutdownService [RUNNING]=25, OutputSetupService [RUNNING]=172, BufferSynchronizerService [RUNNING]=180, KafkaJournal [RUNNING]=183, EtagService [RUNNING]=212, ConfigurationEtagService [RUNNING]=253, JournalReader [RUNNING]=260, LookupTableService [RUNNING]=260, InputSetupService [RUNNING]=263, StreamCacheService [RUNNING]=343, PeriodicalsService [RUNNING]=454, JerseyService [RUNNING]=21428}
2019-07-29T06:53:43.703Z INFO  [ServerBootstrap] Graylog server up and running.
2019-07-29T06:53:43.717Z INFO  [InputStateListener] Input [Beats/5d3d7fe5376cc71ed63ca6df] is now STARTING
2019-07-29T06:53:43.787Z INFO  [InputStateListener] Input [Beats/5d3d7fe5376cc71ed63ca6df] is now RUNNING
2019-07-29T06:53:43.796Z WARN  [AbstractTcpTransport] receiveBufferSize (SO_RCVBUF) for input Beats2Input{title=beats, type=org.graylog.plugins.beats.Beats2Input, nodeId=null} (channel [id: 0x0e6006ed, L:/0:0:0:0:0:0:0:0%0:5045]) should be 1048576 but is 425984.

2019-07-29T07:57:01.676Z ERROR [AbstractTcpTransport] Error in Input [Beats/5d3d7fe5376cc71ed63ca6df] (channel [id: 0x6b291f6d, L:/172.16.2.6:5045 ! R:/52.163.89.113:11400]) (cause io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer)
2019-07-29T07:57:01.982Z ERROR [AbstractTcpTransport] Error in Input [Beats/5d3d7fe5376cc71ed63ca6df] (channel [id: 0xaeb83439, L:/172.16.2.6:5045 ! R:/137.116.131.234:62272]) (cause io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer)
2019-07-29T07:57:02.059Z ERROR [AbstractTcpTransport] Error in Input [Beats/5d3d7fe5376cc71ed63ca6df] (channel [id: 0xf0db807c, L:/172.16.2.6:5045 ! R:/137.116.131.234:63496]) (cause io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer)
2019-07-29T07:57:05.168Z ERROR [AbstractTcpTransport] Error in Input [Beats/5d3d7fe5376cc71ed63ca6df] (channel [id: 0x89bc7c55, L:/172.16.2.6:5045 ! R:/13.76.211.33:21096]) (cause io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer)
2019-07-29T08:13:40.351Z ERROR [AbstractTcpTransport] Error in Input [Beats/5d3d7fe5376cc71ed63ca6df] (channel [id: 0xebe6e5ac, L:/172.16.2.6:5045 ! R:/137.116.131.234:62273]) (cause io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer)
2019-07-29T08:13:44.796Z ERROR [AbstractTcpTransport] Error in Input [Beats/5d3d7fe5376cc71ed63ca6df] (channel [id: 0x83029cd9, L:/172.16.2.6:5045 ! R:/13.76.211.33:21096]) (cause io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer)
2019-07-29T08:13:46.403Z ERROR [AbstractTcpTransport] Error in Input [Beats/5d3d7fe5376cc71ed63ca6df] (channel [id: 0x348fd719, L:/172.16.2.6:5045 ! R:/52.163.89.113:11488]) (cause io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer)
2019-07-29T08:13:50.773Z ERROR [AbstractTcpTransport] Error in Input [Beats/5d3d7fe5376cc71ed63ca6df] (channel [id: 0x04c460e4, L:/172.16.2.6:5045 ! R:/52.163.89.113:11384]) (cause io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer)
2019-07-29T08:13:50.979Z ERROR [AbstractTcpTransport] Error in Input [Beats/5d3d7fe5376cc71ed63ca6df] (channel [id: 0xb5059313, L:/172.16.2.6:5045 ! R:/137.116.131.234:63497]) (cause io.netty.channel.unix.Errors$NativeIoException: syscall:read(..) failed: Connection reset by peer)
2019-07-29T09:33:00.230Z INFO  [connection] Opened connection [connectionId{localValue:11, serverValue:134}] to 192.16.2.6:27017
2019-07-29T09:33:00.231Z INFO  [connection] Opened connection [connectionId{localValue:10, serverValue:136}] to 192.16.2.6:27017
2019-07-29T09:33:00.231Z INFO  [connection] Opened connection [connectionId{localValue:9, serverValue:135}] to 192.16.2.6:27017
2019-07-29T09:33:21.741Z INFO  [connection] Opened connection [connectionId{localValue:12, serverValue:137}] to 192.16.2.6:27017
2019-07-29T09:35:34.130Z INFO  [connection] Opened connection [connectionId{localValue:13, serverValue:138}] to 192.16.2.6:27017
2019-07-29T09:35:34.137Z INFO  [connection] Opened connection [connectionId{localValue:14, serverValue:139}] to 192.16.2.6:27017
2019-07-29T09:37:45.738Z INFO  [connection] Opened connection [connectionId{localValue:16, serverValue:140}] to 192.16.2.6:27017
2019-07-29T09:37:45.739Z INFO  [connection] Opened connection [connectionId{localValue:15, serverValue:141}] to 192.16.2.6:27017
2019-07-29T09:39:07.683Z INFO  [connection] Opened connection [connectionId{localValue:18, serverValue:142}] to 192.16.2.6:27017
2019-07-29T09:39:07.713Z INFO  [connection] Opened connection [connectionId{localValue:17, serverValue:143}] to 192.16.2.6:27017
2019-07-29T09:44:18.107Z ERROR [AnyExceptionClassMapper] Unhandled exception in REST resource
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_212]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_212]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_212]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_212]
        at okio.Okio$2.read(Okio.java:140) ~[graylog.jar:?]
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) ~[graylog.jar:?]
        at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355) ~[graylog.jar:?]
        at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227) ~[graylog.jar:?]
        at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215) ~[graylog.jar:?]
        at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189) ~[graylog.jar:?]
        at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
        at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:61) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200) ~[graylog.jar:?]
        at okhttp3.RealCall.execute(RealCall.java:77) ~[graylog.jar:?]
        at retrofit2.OkHttpCall.execute(OkHttpCall.java:180) ~[graylog.jar:?]
        at org.graylog2.rest.resources.cluster.ClusterNodeMetricsResource.byNamespace(ClusterNodeMetricsResource.java:122) ~[graylog.jar:?]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_212]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_212]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_212]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_212]
        at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$TypeOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:205) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) ~[graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:315) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:297) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:267) [graylog.jar:?]
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) [graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) [graylog.jar:?]
        at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) [graylog.jar:?]
        at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:384) [graylog.jar:?]
        at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:224) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181) [graylog.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
^[[5~2019-07-29T09:49:12.655Z WARN  [ProxiedResource] Unable to call http://graylog.southeastasia.cloudapp.azure.com:9000/api/system on node <7bd8596d-a8ab-406e-abca-0715d20b8f70>
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_212]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_212]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_212]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_212]
        at okio.Okio$2.read(Okio.java:140) ~[graylog.jar:?]
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) ~[graylog.jar:?]
        at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355) ~[graylog.jar:?]
        at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227) ~[graylog.jar:?]
        at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215) ~[graylog.jar:?]
        at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189) ~[graylog.jar:?]
        at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
        at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:61) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200) ~[graylog.jar:?]
        at okhttp3.RealCall.execute(RealCall.java:77) ~[graylog.jar:?]
        at retrofit2.OkHttpCall.execute(OkHttpCall.java:180) ~[graylog.jar:?]
        at org.graylog2.shared.rest.resources.ProxiedResource.lambda$getForAllNodes$0(ProxiedResource.java:78) ~[graylog.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
2019-07-29T09:51:07.236Z WARN  [NodePingThread] Did not find meta info of this node. Re-registering.

Though we have checked in Elastic logs but we couldn’t find any exceptions.

Below is the graylog server.conf,

############################
# GRAYLOG CONFIGURATION FILE
############################
is_master = true
node_id_file = /etc/graylog/server/node-id
password_secret = SEtby4wg3kBPBwkYgqcxRlfCrRf8NbmWSLFvrXqsJ6SorC1WGwFWehPS2cQ1kjSc6Ea0cbhO080TSW
root_username = admin
root_password_sha2 = 00f86d4e5f5e3265f8546589dbf3a8834dce95e5d54388b5b458cfae
root_timezone = UTC
bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin
http_bind_address = 0.0.0.0:9000
http_publish_uri = http://graylog.southeastasia.cloudapp.azure.com:9000/
http_external_uri = http://graylog.southeastasia.cloudapp.azure.com:9000/
http_enable_cors = true
http_enable_gzip = true
http_max_header_size = 8192
http_thread_pool_size = 16
elasticsearch_hosts = http://190.32.3.4:9201
elasticsearch_connect_timeout = 10s
elasticsearch_socket_timeout = 60s
elasticsearch_max_total_connections = 20
elasticsearch_max_total_connections_per_route = 2
elasticsearch_max_retries = 2
rotation_strategy = count
elasticsearch_max_docs_per_index = 20000000
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 4
elasticsearch_replicas = 0
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 5000
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 8
outputbuffer_processors = 4
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 2
inputbuffer_wait_strategy = blocking
message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
lb_recognition_period_seconds = 3
mongodb_uri = mongodb://190.32.3.4:27017/graylog
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
proxied_requests_thread_pool_size = 32

Note:-
We have done the same setup in another Azure VM where we have installed Graylog 2.5 and processing the same IIS logs without any issues and we didn’t change any graylog config and we used the default config.

Please share your thoughts and it would be very helpful.

Regards,
Ganeshbabu R

dots in fieldnames are not possible. change your grok pattern and it should run.

@jan

I did tried removing the dots in the fieldnames and changed the grok pattern with Camelcase, now I have been closely monitoring the graylog and in last one hour the heap usage of the node looks stable by then messages are coming to streams.

But after certain time the message are started going to process buffer itself and message are not processing to elastic

I am processing only IIS access log via filebeat as in the last one hour below is the number of message came to stream,

Once the heap space max limit is reached I am started getting this messages in graylog server.log

2019-07-29T19:24:31.148Z INFO  [connection] Opened connection [connectionId{localValue:11, serverValue:180}] to 192.16.2.6:27017
2019-07-29T19:26:09.567Z INFO  [connection] Opened connection [connectionId{localValue:13, serverValue:182}] to 192.16.2.6:27017
2019-07-29T19:26:09.568Z INFO  [connection] Opened connection [connectionId{localValue:15, serverValue:183}] to 192.16.2.6:27017
2019-07-29T19:26:09.569Z INFO  [connection] Opened connection [connectionId{localValue:12, serverValue:181}] to 192.16.2.6:27017
2019-07-29T19:26:09.570Z INFO  [connection] Opened connection [connectionId{localValue:16, serverValue:185}] to 192.16.2.6:27017
2019-07-29T19:26:09.570Z INFO  [connection] Opened connection [connectionId{localValue:17, serverValue:186}] to 192.16.2.6:27017
2019-07-29T19:26:09.572Z INFO  [connection] Opened connection [connectionId{localValue:14, serverValue:184}] to 192.16.2.6:27017
2019-07-29T19:40:53.028Z WARN  [ProxiedResource] Unable to call http://graylog.southeastasia.cloudapp.azure.com:9000/api/system on node <7bd8596d-a8ab-406e-abca-0715d20b8f70>
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_212]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_212]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_212]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_212]
        at okio.Okio$2.read(Okio.java:140) ~[graylog.jar:?]
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) ~[graylog.jar:?]
        at okio.RealBufferedSource.indexOf(RealBufferedSource.java:355) ~[graylog.jar:?]
        at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:227) ~[graylog.jar:?]
        at okhttp3.internal.http1.Http1Codec.readHeaderLine(Http1Codec.java:215) ~[graylog.jar:?]
        at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189) ~[graylog.jar:?]
        at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
        at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:61) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121) ~[graylog.jar:?]
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:200) ~[graylog.jar:?]
        at okhttp3.RealCall.execute(RealCall.java:77) ~[graylog.jar:?]
        at retrofit2.OkHttpCall.execute(OkHttpCall.java:180) ~[graylog.jar:?]
        at org.graylog2.shared.rest.resources.ProxiedResource.lambda$getForAllNodes$0(ProxiedResource.java:78) ~[graylog.jar:?]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]
2019-07-29T19:41:15.190Z WARN  [NodePingThread] Did not find meta info of this node. Re-registering.

Should I need to make changes in processbuffer parameter in graylog server.conf?

Not able to understand what could be the reason for this behaviour in graylog and when I executed the top command inside the graylog sever and I could see the %CPU is reached 100 for graylog process id

image

Do you think graylog doesn’t have enough cpu cores to process the messages?

Note:-
But we tried the same grok pattern with dots in fieldnames in graylog 2.5 and in that environment I could see messages were processing without any errors.

Please correct me if my understanding is wrong and share your thoughts.

Thanks,
Ganeshbabu R

you (and we) could guess what is going on - or you have a system where you store the internal Graylog metrics to display trends over time and you can spot at what point Graylog has issues.

The position I’m in now is where I’ll jump out of this now - as I do not have the time to dig that deep into a system without getting anything back (reads - for me would now start the payed work).

We are seeing similair issues on our graylog multi-cluser. Java connections fail, graylog not putting any messages out to elasticsearch.
Tried settings the /etc/hosts file, no solution.
Now trying: stale_master_timeout set to 2000 (Default value?) since date shows all servers having same date + time to the second…

Add more CPUs. you have processbuffer_processors set to 8, outputbuffer_processors set to 4, but you only have 4 CPUs.

Not sure what your ingest volume is, so can’t make recommendations on that, but if you only have 4 CPUs, you can’t allocate 12.

@cawfehman

Yes true graylog is not able to process any messages due to unavailability of CPU resources… we increased compute to both Graylog and Elastic now messages are processing in good state…

But we also found the reason for graylog nodes crashes down but not sure the exact reason why this behaviour is happening in graylog.

Let me know your thoughts…

Thanks,
Ganeshbabu R

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.