Graylog - Uncommited messages deleted from journal & utilization is too high


(Ganeshbabu Ramamoorthy) #1

Hi All,

We installed Graylog 2.3 & Elasticsearch 5.5.2 version in Ubuntu 16.04 machine and I am getting this below notifications in graylog,

Deflector exists as an index and is not an alias. (triggered 20 hours ago)
The deflector is meant to be an alias but exists as an index. Multiple failures of infrastructure can lead to this. Your messages are still indexed but searches and all maintenance tasks will fail or produce incorrect results. It is strongly recommend that you act as soon as possible.
×
 Uncommited messages deleted from journal (triggered 21 hours ago)
Some messages were deleted from the Graylog journal before they could be written to Elasticsearch. Please verify that your Elasticsearch cluster is healthy and fast enough. You may also want to review your Graylog journal settings and set a higher limit. (Node: 291ee918-b16c-ca1752)
×
 Journal utilization is too high (triggered a day ago)
Journal utilization is too high and may go over the limit soon. Please verify that your Elasticsearch cluster is healthy and fast enough. You may also want to review your Graylog journal settings and set a higher limit. (Node: 291ee918-b16c-ca1752)

There is no errors & warning messages in both graylog & elasticsearch logs.

Elasticsearch cluster health

root@Graylog:/etc/graylog# curl -XGET http://Graylog:9200/_cluster/health?pretty=true
{
  "cluster_name" : "*Graylog*",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 36,
  "active_shards" : 36,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Please kindly advice how to get resolve it?

Regards,
Ganeshbabu R


(Jan Doberstein) #2

you need to remove those message by click on the x in the top right.

But you should then investigate why those messages are triggered.


(Ganeshbabu Ramamoorthy) #3

@jan

But you should then investigate why those messages are triggered.

I wasn’t sure how to investigate these messages but after removing the message in the notifications I found this warning message in the server.log

2017-11-15T08:37:15.619Z WARN  [ProxiedResource] Unable to call http://graylogserver.southeastasia.cloudapp.azure.com:9000/api/system/metrics/multiple on node <291ee918-b16c-4321-b6f8-7a88a3ca1752>
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method) ~[?:1.8.0_144]
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116) ~[?:1.8.0_144]
        at java.net.SocketInputStream.read(SocketInputStream.java:171) ~[?:1.8.0_144]
        at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[?:1.8.0_144]
        at okio.Okio$2.read(Okio.java:139) ~[graylog.jar:?]
        at okio.AsyncTimeout$2.read(AsyncTimeout.java:237) ~[graylog.jar:?]
        at okio.RealBufferedSource.indexOf(RealBufferedSource.java:345) ~[graylog.jar:?]
        at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:217) ~[graylog.jar:?]
        at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.java:211) ~[graylog.jar:?]
        at okhttp3.internal.http1.Http1Codec.readResponseHeaders(Http1Codec.java:189) ~[graylog.jar:?]
        at okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:75) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
        at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
        at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
        at org.graylog2.rest.RemoteInterfaceProvider.lambda$get$0(RemoteInterfaceProvider.java:59) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92) ~[graylog.jar:?]
        at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67) ~[graylog.jar:?]
        at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185) ~[graylog.jar:?]
        at okhttp3.RealCall.execute(RealCall.java:69) ~[graylog.jar:?]
        at retrofit2.OkHttpCall.execute(OkHttpCall.java:180) ~[graylog.jar:?]
        at org.graylog2.shared.rest.resources.ProxiedResource.lambda$getForAllNodes$0(ProxiedResource.java:76) ~[graylog.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]
2017-11-15T08:37:57.150Z WARN  [NodePingThread] Did not find meta info of this node. Re-registering.
2017-11-15T08:41:28.765Z WARN  [DefaultFilterChain] GRIZZLY0013: Exception during FilterChain execution
java.lang.OutOfMemoryError: Java heap space
2017-11-15T08:41:28.766Z ERROR [SelectorRunner] doSelect exception
java.lang.OutOfMemoryError: Java heap space
        at sun.nio.ch.Net.localInetAddress(Native Method) ~[?:1.8.0_144]
        at sun.nio.ch.Net.localAddress(Net.java:479) ~[?:1.8.0_144]
        at sun.nio.ch.SocketChannelImpl.<init>(SocketChannelImpl.java:133) ~[?:1.8.0_144]
        at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:266) ~[?:1.8.0_144]
        at org.glassfish.grizzly.nio.transport.TCPNIOServerConnection.doAccept(TCPNIOServerConnection.java:169) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.transport.TCPNIOServerConnection.onAccept(TCPNIOServerConnection.java:233) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.transport.TCPNIOTransport.fireIOEvent(TCPNIOTransport.java:508) ~[graylog.jar:?]
        at org.glassfish.grizzly.strategies.AbstractIOStrategy.fireIOEvent(AbstractIOStrategy.java:112) ~[graylog.jar:?]
        at org.glassfish.grizzly.strategies.SameThreadIOStrategy.executeIoEvent(SameThreadIOStrategy.java:103) ~[graylog.jar:?]
        at org.glassfish.grizzly.strategies.AbstractIOStrategy.executeIoEvent(AbstractIOStrategy.java:89) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.SelectorRunner.iterateKeyEvents(SelectorRunner.java:415) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.SelectorRunner.iterateKeys(SelectorRunner.java:384) ~[graylog.jar:?]
        at org.glassfish.grizzly.nio.SelectorRunner.doSelect(SelectorRunner.java:348) [graylog.jar:?]
        at org.glassfish.grizzly.nio.SelectorRunner.run(SelectorRunner.java:279) [graylog.jar:?]
        at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:593) [graylog.jar:?]
        at org.glassfish.grizzly.threadpool.AbstractThreadPool$Worker.run(AbstractThreadPool.java:573) [graylog.jar:?]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144]

you need to remove those message by click on the x in the top right.

I removed those messages in the notifications and I was again trying to send some data to graylog but the below error is showing in the filebeat,

2017-11-15T09:16:33Z INFO Harvester started for file: /etc/graylog/graylog.csv
2017-11-15T09:16:52Z INFO Non-zero metrics in the last 30s: filebeat.harvester.running=1 filebeat.harvester.started=1 filebeat.harvester.open_files=1 libbeat.logstash.call_count.PublishEvents=1 libbeat.logstash.publish.write_bytes=3149 libbeat.publisher.published_events=1383
2017-11-15T09:17:03Z ERR Failed to publish events caused by: read tcp 12.0.0.5:52274->42.234.120.00:5044: i/o timeout
2017-11-15T09:17:03Z INFO Error publishing events (retrying): read tcp 12.0.0.5:52274->42.234.120.00:5044: i/o timeout
2017-11-15T09:17:22Z INFO Non-zero metrics in the last 30s: publish.events=10002 registar.states.current=1 libbeat.logstash.published_and_acked_events=6747 libbeat.logstash.published_but_not_acked_events=1383 libbeat.publisher.published_events=5364 libbeat.logstash.call_count.PublishEvents=5 libbeat.logstash.publish.read_errors=1 libbeat.logstash.publish.read_bytes=4062 libbeat.logstash.publish.write_bytes=2684556 registrar.states.update=10002 registrar.writes=5

Let me know if you need any further informations.

Thanks,
Ganeshbabu R


(Jan Doberstein) #4

@Ganeshbabu

you mix everything together and then ask a question out of that mix …


1. Error messages:

Are the error messages returning after you had removed them (by click on the x) ?


2. Collector Sidecar Delivery of Logs:

From the small Log snipped and the not really provided Information I would suggest the following:

 12.0.0.5:52274->42.234.120.00:5044
  • Check if the connection between the Server where Collector-Sidecar/Filebeat is running to Graylog.
  • Check if you have a Beats Input running on Graylog on Port 5044
  • Check if the Server where Collector-Sidecar/Filebeat is running is able to connect to Port 5044 of the Graylog Server.

(Ganeshbabu Ramamoorthy) #5

you mix everything together and then ask a question out of that mix …

@jan
My intention was not to make any confusion and I was trying to send some data to graylog through filebeat to check whether I would get the same error messages or not.

Are the error messages returning after you had removed them (by click on the x) ?

No, error messages are not there after I removed.

Deflector exists as an index and is not an alias. (triggered 20 hours ago)
The deflector is meant to be an alias but exists as an index. Multiple failures of infrastructure can lead to this. Your messages are still indexed but searches and all maintenance tasks will fail or produce incorrect results. It is strongly recommend that you act as soon as possible.
×
 Uncommited messages deleted from journal (triggered 21 hours ago)
Some messages were deleted from the Graylog journal before they could be written to Elasticsearch. Please verify that your Elasticsearch cluster is healthy and fast enough. You may also want to review your Graylog journal settings and set a higher limit. (Node: 291ee918-b16c-ca1752)
×
 Journal utilization is too high (triggered a day ago)
Journal utilization is too high and may go over the limit soon. Please verify that your Elasticsearch cluster is healthy and fast enough. You may also want to review your Graylog journal settings and set a higher limit. (Node: 291ee918-b16c-ca1752)

I don’t know for what reason these error message were thrown and if you could share the root cause or any documentation for reference it would be very helpful.

Regards,
Ganeshbabu R


(Jan Doberstein) #6

sorry that we missed that in our documentation. But thank you for pointing that out - i’ll add a section in the FAQ about it ASAP.

But let me explain first. You have the settings about your journal in the the Graylog server.conf

Taking the defaults as example. If the messages in the journal are older than 12 hours it will start dropping the messages from the journal or if the size of the journal grow above 5 GB.

What you can do:

  • first check the Journal on your Graylog Nodes in the Graylog UI ( System > Nodes )
    • check if you still have a journal that is full
  • raise the resources of your Elasticsearch Cluster
    • analyze where the bottlenecks are
    • adjust the Elasticsearch connection settings in Graylog

(Ganeshbabu Ramamoorthy) #7

Thanks @jan for sharing this information and now I have the clear path from onwards. I will check the journal in my graylog node and will try troubleshooting the errors.

I will post the update soon.

Regards,
Ganeshbabu R


(system) #8

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.