Messages being fed to Graylog and apparently indexed, but they're not visible anymore

Yes, Elasticsearch is running on the same server, and yes, Elasticsearch is continuously running. I typically don’t do anything with Elasticsearch between the ‘systemctl restart graylog-server’ calls, and haven’t in the cases described above.

did you find anything useful in the elasticsearch logs? like high/low watermark or similar?

“grep -i watermark /var/log/elasticsearch/*” gives no hits. Graylog jumps to almost 200% CPU use in “top” (on a 2-CPU system) and stays there when the problem hits, BTW.

Further information: the current Elasticsearch log /var/log/gc.log. contains lines of the same type as it did on a date that everything was working fine (I picked 2019-09-03), i.e. it contains stuff like the following:

2019-10-28T10:59:01.837+0100: 4,589: Total time for which application threads were stopped: 0,0002891 seconds, Stopping threads took: 0,0000282 seconds
2019-10-28T10:59:01.991+0100: 4,743: [GC (Allocation Failure) 2019-10-28T10:59:01.991+0100: 4,743: [ParNew
Desired survivor size 8716288 bytes, new threshold 2 (max 6)

  • age 1: 8448704 bytes, 8448704 total
  • age 2: 273728 bytes, 8722432 total
    : 144392K->9108K(153344K), 0,0135769 secs] 148566K->13281K(1031552K), 0,0136451 secs] [Times: user=0,02 sys=0,00, real=0,01 secs]

2019-10-28T10:59:02.005+0100: 4,756: Total time for which application threads were stopped: 0,0137915 seconds, Stopping threads took: 0,0000414 seconds
2019-10-28T10:59:02.006+0100: 4,758: Total time for which application threads were stopped: 0,0000790 seconds, Stopping threads took: 0,0000247 seconds

However, it looks to me as if now there are now much larger blocks of lines reading “Total time for which application threads were stopped: seconds, Stopping threads took: seconds” than in the logs from 3 Sep. I don’t know what these lines mean, though, so I can’t deduce any meaning from them, their number or their frequency.

The current Elasticsearch gc.log also contains lines like the following every now and then:

2019-10-28T11:02:09.954+0100: 192,706: [GC (CMS Initial Mark) [1 CMS-initial-mark: 662413K(878208K)] 697483K(1031552K), 0,0037693 secs] [Times: user=0,00 sys=0,00, real=0,00 secs]
2019-10-28T11:02:09.958+0100: 192,710: [CMS-concurrent-mark-start]
2019-10-28T11:02:10.834+0100: 193,585: [CMS-concurrent-mark: 0,876/0,876 secs] [Times: user=0,45 sys=0,01, real=0,87 secs]
2019-10-28T11:02:10.834+0100: 193,585: [CMS-concurrent-preclean-start]
2019-10-28T11:02:10.853+0100: 193,604: [CMS-concurrent-preclean: 0,019/0,019 secs] [Times: user=0,01 sys=0,00, real=0,02 secs]
2019-10-28T11:02:10.853+0100: 193,604: [CMS-concurrent-abortable-preclean-start]
2019-10-28T11:02:11.106+0100: 193,857: [GC (Allocation Failure)
2019-10-28T11:02:11.106+0100: 193,857: [ParNew2019-10-28T11:02:11.114+0100: 193,865: [CMS-concurrent-abortable-preclean: 0,060/0,261 secs] [Times: user=0,11 sys=0,00, real=0,26 secs]
2019-10-28T11:02:11.168+0100: 193,920: [GC (CMS Final Remark) [YG occupancy: 30428 K (153344 K)]
2019-10-28T11:02:11.168+0100: 193,920: [Rescan (parallel) , 0,0187953 secs]
2019-10-28T11:02:11.187+0100: 193,938: [weak refs processing, 0,0003606 secs]
2019-10-28T11:02:11.188+0100: 193,939: [class unloading, 0,0527873 secs]
2019-10-28T11:02:11.240+0100: 193,992: [scrub symbol table, 0,0274740 secs]
2019-10-28T11:02:11.268+0100: 194,019: [scrub string table, 0,0183974 secs][1 CMS-remark: 674418K(878208K)] 704847K(1031552K), 0,1181006 secs] [Times: user=0,03 sys=0,00, real=0,12 secs]

Whatever that may be…

what have you configured in your /etc/elasticsearch/elasticsearch.yml as logging location? maybe that is different from the default.

Cause it is not very likely that no logs are written …

Excerpt from /etc/elasticsearch/elasticsearch.yml:

# Path to log files:
#
path.logs: /var/log/elasticsearch

Umm, it’s not as if no logs are being written. They just don’t look extraordinary and no different from when everything was working fine, at least as far as I can tell.

you must have something that is different. I have no other idea where to look and what to compare.

You need to Sherlock the reason now on your own … I thought it is something with elasticsearch. Like you manipulated the indices, or it is read only and the messages are only queued but not saved but as nothing can be found in the logs I can’t imagine what the problem is …

I am steadily progressing, without having had a breakthrough yet, though. In analysing the graylog server.log with DEBUG level logging turned on, I see that this appears to be the normal line of events when messages are processed. I’ve removed datestamps for better readablity – all of these events take place within two seconds:

  • Phase 1: messages are read from a journal and passed to the process queue:

DEBUG [JournallingMessageHandler] End of batch, journalling 2 messages
DEBUG [KafkaJournal] Trying to write ByteBufferMessageSet with size of 1576 bytes to journal
DEBUG [KafkaJournal] Wrote 2 messages to journal: 1576 bytes (payload 1492 bytes), log position 6978716 to 6978717
DEBUG [JournallingMessageHandler] Processed batch, wrote 1492 bytes, last journal offset: 6978717, signalling reader.
DEBUG [JournalReader] Messages have been written to Journal, continuing to read.
DEBUG [KafkaJournal] Requesting to read a maximum of 65536 messages (or 5MB) from the journal, offset interval [6978716, 7044252)
DEBUG [KafkaJournal] Read 2 messages, total payload size 1492, from journal, offset interval [6978716, 6978717], requested read at 6978716
DEBUG [JournalReader] Processing 2 messages from journal.
DEBUG [KafkaJournal] Requesting to read a maximum of 65534 messages (or 5MB) from the journal, offset interval [6978718, 7044252)
DEBUG [KafkaJournal] No messages available to read for offset interval [6978718, 7044252).
DEBUG [JournalReader] No messages to read from Journal, waiting until the writer adds more messages.

  • Phase 2: The message processor kicks in – two messages have been queued above:

DEBUG [ProcessBufferProcessor] Starting to process message .
DEBUG [ProcessBufferProcessor] Starting to process message .
DEBUG [MessageFilterChainProcessor] Applying filter [Extractor] on message .
DEBUG [MessageFilterChainProcessor] Applying filter [Extractor] on message .
DEBUG [MessageFilterChainProcessor] Applying filter [Static field appender] on message .
DEBUG [MessageFilterChainProcessor] Applying filter [Static field appender] on message .
DEBUG [MessageFilterChainProcessor] Applying filter [StreamMatcher] on message .
DEBUG [MessageFilterChainProcessor] Applying filter [StreamMatcher] on message .
DEBUG [StreamMatcherFilter] Routed message to 0 streams.
DEBUG [StreamMatcherFilter] Routed message to 0 streams.

  • Phase 3: Now that processing, which consisted of the MessageFilterChainProcessor for three filters (Extractor, Static field appender and StreamMatcher), and the StreamMatcherFilter, has completed, both messages are written to the output buffer:

DEBUG [ProcessBufferProcessor] Finished processing message . Writing to output buffer.
DEBUG [ProcessBufferProcessor] Finished processing message . Writing to output buffer.
DEBUG [OutputBufferProcessor] Processing message from OutputBuffer.
DEBUG [OutputBufferProcessor] Processing message from OutputBuffer.
DEBUG [OutputBufferProcessor] Writing message to [class org.graylog2.outputs.BlockingBatchedESOutput].
DEBUG [OutputBufferProcessor] Writing message to [class org.graylog2.outputs.BlockingBatchedESOutput].
DEBUG [OutputBufferProcessor] Wrote message to all outputs. Finished handling.
DEBUG [OutputBufferProcessor] Wrote message to all outputs. Finished handling.

The use of class org.graylog2.outputs.BlockingBatchedESOutput for output looks to me like Elasticsearch (ES) was written to.

Compare this behaviour with my faulty node. It consists only of phase 1:

DEBUG [JournallingMessageHandler] End of batch, journalling 1 messages
DEBUG [KafkaJournal] Trying to write ByteBufferMessageSet with size of 785 bytes to journal
DEBUG [KafkaJournal] Wrote 1 messages to journal: 785 bytes (payload 743 bytes), log position 197984570 to 197984570
DEBUG [JournallingMessageHandler] Processed batch, wrote 743 bytes, last journal offset: 197984570, signalling reader.
DEBUG [JournalReader] Messages have been written to Journal, continuing to read.
DEBUG [KafkaJournal] Requesting to read a maximum of 895 messages (or 5MB) from the journal, offset interval [197984570, 197985465)
DEBUG [KafkaJournal] Read 1 messages, total payload size 743, from journal, offset interval [197984570, 197984570], requested read at 197984570
DEBUG [JournalReader] Processing 1 messages from journal.
DEBUG [KafkaJournal] Requesting to read a maximum of 894 messages (or 5MB) from the journal, offset interval [197984571, 197985465)
DEBUG [KafkaJournal] No messages available to read for offset interval [197984571, 197985465).
DEBUG [JournalReader] No messages to read from Journal, waiting until the writer adds more messages.

The ProcessBufferProcessor is not invoked, messages keep queueing, but nothing is being processed anymore. And processing never resumes…

I’ll try to capture the transition in a DEBUG or perhaps even TRACE level log next.

does the journal - on the faulty node - show a minus number in the system > node > nodes name page for the journal size?

It might be that the journal is corrupt and the data is written into the nirvana.

Sorry, no, the journal size is 5 GB, and messages are being written to the journal and recovered from the journal when the node is feeling like processing messages.

I have managed to get Graylog to survive, albeit with a rather radical measure – I have deleted all of the extractors. Previously, after having collected 1.5 GB of Graylog server.log data (TRACE set for all but the authentication subsystems), I found that practically all messages go through:

  • [ProcessBufferProcessor] Starting to process message <UUID>.
  • [MessageFilterChainProcessor] Applying filter [Extractor] on message <UUID>.
  • [MessageFilterChainProcessor] Applying filter [Static field appender] on message <UUID>.
  • [MessageFilterChainProcessor] Applying filter [StreamMatcher] on message <UUID>.
  • [ProcessBufferProcessor] Finished processing message <UUID>. Writing to output buffer.

However, for the last message I saw in the log before Graylog’s output dropped to zero, only the first two lines were present, i.e.:

  • [ProcessBufferProcessor] Starting to process message <UUID>.
  • [MessageFilterChainProcessor] Applying filter [Extractor] on message <UUID>.

I have a hunch that Graylog is not returning from one of the extractors on this message. And it appears to be doing that sometimes, even though in only very few cases: the log excerpt I analysed contains 68412 messages, of which only ten cause the behaviour described above, i.e. that processing stops at or after “Applying filter [Extractor]”. In the meantime, in the System -> Nodes -> Details screen, the node’s process buffer keeps filling with way more than just ten messages, it eventually goes to 65535, i.e. 100%… I have no explanation for that yet.

I will now be reactivating the extractors one by one, to see which is the problematic one. The complete lack of any errors in server.log won’t speed up the process, sadly…

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.