Graylog Throughput / Metrics problem


(Zero) #1

Hello!
So I’ve got a strange case where everything is “status green” yet Graylog randomly stop ingesting new messages from 1 specific input.
We have a cluster of 3 Kafka’s, every topic has set partition 3 replication 1 but there is this one input which has problems. I’ve discovered that on Input “Show details” it says that only one Graylog node has any I/O for that input, the rest has

Network IO: 0B

Every other input works like a charm, literally same idea: kafka topic with 3 partition and 1 replication. I wouldn’t mind that only one Graylog node ingests anything from that one input but it seems that randomly Graylog stops ingesting from it AT ALL and that’s problematic. Logstash logs 0 problems, so does Kafka.
Please help :smiley:


(Jan Doberstein) #2

So you have multiple inputs that are configured the same way for different kafka topics that are also setup identical. One of them is not working as it should - what is the difference in them?

  • triple check if the configuration is the same
  • check if you see the same log entries when you start the inputs?
  • what is in your log files when no ingest is done?

(Zero) #3

Yes, we have multiple topics on a 3-node kafka cluster, every topic has 3 partitions and 1 replication.
Graylog side every input looks the same, only topic names are different and Input names of course.

We have 3 graylog nodes so this is a 1=1 ratio for Kafka and Graylog.

This situation occurs only on Input, LAG on Kafka offset is slowly increasing and one (sometimes two) graylog consumers.do not “consume” any messages. This is quite strange because all partitions contain the very same log messages so this is a very abstract issue - why does one node consume something and the other doesn’t?

Graylogs logs on nodes that are idle on this one Input state stuff like “Message too big for the consumer” even tho we are running on the defaults and by default we have a maximum message size of 1MB and again: all partitions contain same logs + all graylogs have the same config cuz they are a cluster so this is really strange that one node does something and the other doesn’t :confused:


(Zero) #4

Kafka has those messages in logs when I click “Start Input” in Graylog for that problematic Input:

[2019-01-07 10:49:52,265] INFO Got user-level KeeperException when processing sessionid:0x6XXXX62c008c type:create cxid:0x3 zxid:XXXX9bc9 txntype:-1 reqpath:n/a Error Path:/consumers/graylog2 Error:KeeperErrorCode = NodeExists for /consumers/graylog2 (org.apache.zookeeper.server.PrepRequestProcessor)


(Jan Doberstein) #5

Message too big for the consumer

Did you checked the messages? Maybe one big message is in the queue?


(Zero) #6

After setting everything back to defaults where Kafka ingests messages of maximum size 1MB (and also has a 1MB limit on producer) this is the error I still receive from time to time on Graylog:

2019-01-08T11:56:20.059+01:00 ERROR [KafkaTransport] Kafka consumer error, stopping consumer thread.
kafka.common.MessageSizeTooLargeException: Found a message larger than the maximum fetch size of this consumer on topic xxxx_topic partition 2 at fetch offset 1492647. Increase the fetch size, or decrease the maximum message size the broker will allow.
at kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:90) ~[graylog.jar:?]
at kafka.consumer.ConsumerIterator.makeNext(ConsumerIterator.scala:33) ~[graylog.jar:?]
at kafka.utils.IteratorTemplate.maybeComputeNext(IteratorTemplate.scala:66) ~[graylog.jar:?]
at kafka.utils.IteratorTemplate.hasNext(IteratorTemplate.scala:58) ~[graylog.jar:?]
at org.graylog2.inputs.transports.KafkaTransport$6.run(KafkaTransport.java:228) [graylog.jar:?]
at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:176) [graylog.jar:?]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_191]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_191]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_191]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_191]


(system) #7

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.