Messages being fed to Graylog and apparently indexed, but they're not visible anymore

Hi

In out Graylog test setup, things were working fine and we were making good progress, until, all of a sudden, messages appear not to be being indexed anymore at all. I have so far been unable to find the reason for this nor have I been able to fix it.

I have two inputs, one Beats input listening on 127.0.0.1:5044 and one syslog/TCP input on 127.0.0.1:20514. A local filebeat reads one log file in /var/log and sends its messages to the Beats input. rsyslog on this host receives logs and writes some of them into the file read by filebeat and forwards some of them to Graylog’s syslog/TCP input (as well as writing them to a file). I can tell by looking at the log files, checking the filebeat logs as well as by using tcpdump to check the communication to TCP ports 20514 and 5044 on 127.0.0.1 that messages are definitely being sent (and received) by Graylog. The throughput/metrics of the two inputs suggest the same, reading “1 minute average rates” of 2 msg/s (filebeat) and 13 msg/s (syslog/TCP) respectively. But when I click on the “show received messages”, I see nothing more recent than 2019-10-08 22:03:08.379 for the filebeat input and 2019-10-08 22:03:19.000 for the syslog/TCP input.

Strangely enough, if I look at the indices, the current default index set says: " graylog_10 (active write index) Contains messages up to a few seconds ago (1.2GiB / 2,119,377 messages)". So the index is apparently being updated.

I’m grateful for any hints on how to fix my system. The few hairs I haven’t yet pulled out of my head yet will be, too.

Cheers,
Tobias

two possible problems - solutions:

  • calculate index range (system > indices > index name > maintenance > recalculate index range)
  • look for a broader timeframe - use absolute and search for today one year in the past and one year in the future. Maybe the timestamp is not now or similar.

Hello @jan
I have the same issue, suddenly messages have stopped indexing

Could, you advise, what would be the ssue. I dont have any errors being displayed

check if the cluster is writable - check the logfiles of elasticsearch an graylog.

Sorry for the delayed response, I was out of town for half a week. Unfortunately, index range recalculation didn’t help, and the timestamps don’t appear to be the problem, either. I’ll check if the cluster is writable – I hope I recognise that from the log files.

I did, BTW, restart the machine Graylog is running on on the day I reported the problem, and that resulted in around 81000 messages being indexed and subsequently appearing. But that was it, and no newly arriving messages can be seen.

The logs don’t tell me a lot, I’m afraid. The current Graylog server.log, which begins 2019-10-21T06:40:46.667+02:00, contains only lines reading

WARN [LicenseChecker] License violation - Detected irregular traffic records

On 2019-10-18, when I reported the issue, it still had numerous lines like the following two:

2019-10-18T13:37:41.534+02:00 WARN [Messages] Failed to index message: index=<palo_alto_8> id= error=<{“type”:“mapper_parsing_exception”,“reason”:“failed to parse field [pa_dev_group_hierarchy_level_1] of type [integer] in document with id ‘b14fe960-f19b-11e9-9f74-005056842309’”,“caused_by”:{“type”:“number_format_exception”,“reason”:"For input string: “0x0"”}}>
2019-10-18T13:37:41.534+02:00 ERROR [Messages] Failed to index [1] messages. Please check the index error log in your web interface for the reason. Error: One or more of the items in the Bulk request failed, check BulkResult.getItems() for more information.

I’d asked about these parsing errors here before. Since I wasn’t able to correct the mapping problem, I decided to disable the extractors with the field called pa_dev_group_hierarchy_level_1. I had done that well before 2019-10-18, but since errors relating to this field were still popping up, I decided to be a bit more drastic and rebooted the system. That’s when I got the short burst of messages that made it into Graylog. And the pa_dev_group_hierarchy_level_1 mapping errors have since disappeared.

The other thing I can see in the Graylog server.log is that it rotated indices starting on 2019-10-21T02:00:06.368+02:00, but that looks fine.

As for the Elasticsearch logs, I don’t really undestand them… In /var/log/elasticsearch, I’ve got a number of files of the form gc.log.[.current], with gc.log.0.current being the one with today’s timestamp. Its last few lines read as follows:

2019-10-23T10:12:26.384+0200: 420429,081: Total time for which application threads were stopped: 0,0140923 seconds, Stopping threads took: 0,0000427 seconds
2019-10-23T10:17:06.375+0200: 420709,072: [GC (Allocation Failure) 2019-10-23T10:17:06.375+0200: 420709,072: [ParNew
Desired survivor size 8716288 bytes, new threshold 6 (max 6)

  • age 1: 335880 bytes, 335880 total
  • age 2: 408 bytes, 336288 total
  • age 3: 32 bytes, 336320 total
    : 136817K->432K(153344K), 0,0120600 secs] 526626K->390241K(1031552K), 0,0121570 secs] [Times: user=0,01 sys=0,00, real=0,01 secs]

I have no idea what that means. The other current log files in that directory are:

  • graylog_deprecation.log
  • graylog_index_indexing_slowlog.log (zero bytes)
  • graylog_index_search_slowlog.log (zero bytes)
  • graylog.log

The entire contents of graylog.log are:
[2019-10-21T02:00:07,215][INFO ][o.e.c.m.MetaDataIndexTemplateService] [Tt8LRUG] adding template [firewall-1-template] for index patterns [firewall-1_]
[2019-10-21T02:00:07,907][INFO ][o.e.c.m.MetaDataCreateIndexService] [Tt8LRUG] [firewall-1_25] creating index, cause [api], templates [firewall-1-template], shards [4]/[0], mappings [message]
[2019-10-21T02:00:08,975][INFO ][o.e.c.r.a.AllocationService] [Tt8LRUG] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[firewall-1_25][2], [firewall-1_25][3], [firewall-1_25][1]] …]).
[2019-10-21T02:00:10,199][INFO ][o.e.c.m.MetaDataIndexTemplateService] [Tt8LRUG] adding template [mswinsecaudit-template] for index patterns [mswinsecaudit_
]
[2019-10-21T02:00:10,206][INFO ][o.e.c.m.MetaDataCreateIndexService] [Tt8LRUG] [mswinsecaudit_13] creating index, cause [api], templates [mswinsecaudit-template], shards [4]/[0], mappings [message]
[2019-10-21T02:00:10,348][INFO ][o.e.c.r.a.AllocationService] [Tt8LRUG] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[mswinsecaudit_13][2], [mswinsecaudit_13][1], [mswinsecaudit_13][0]] …]).
[2019-10-21T02:00:10,434][INFO ][o.e.c.m.MetaDataIndexTemplateService] [Tt8LRUG] adding template [palo_alto-template] for index patterns [palo_alto_*]
[2019-10-21T02:00:10,442][INFO ][o.e.c.m.MetaDataCreateIndexService] [Tt8LRUG] [palo_alto_9] creating index, cause [api], templates [paloalto-custom-mapping, palo_alto-template], shards [4]/[0], mappings [message]
[2019-10-21T02:00:10,609][INFO ][o.e.c.r.a.AllocationService] [Tt8LRUG] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[palo_alto_9][1], [palo_alto_9][0]] …]).

I don’t see anything alarming there, either. Any further hints?

Cheers,
Tobias

So you have index failures because of a wrong mapping - all messages that does not fit the mapping that is created from the first message that is ingested into the elasticsearch index will be dropped. The game restarts on index rotation.

The only way to by-pass this is by creating a custom mapping that forces a specific mapping for a specific field.

I have disabled the extractor that led to the mapping errors. Notice that the last mapping error is from 18 Oct, which is five days ago.

BTW, the mapping errors had been there for a considerable period of time and didn’t prevent messages from being indexed and showing up. Maybe some messages were lost, but definitely only a tiny fraction. Now I can’t see anything.

Here you can see that the inputs are currently receiving messages:

But the search window is empty except for messages that came in within a certain time period leading up to when I last rebooted the server – note that I’ve blurred the message source and messages for the sake of my client’s privacy:

that you have 0 out messages means - your Elasticsarch is not accepting messages.

You should check your Graylog server.log …

in addition maybe the health of the elasticsearch server/cluster … I guess your elasticsearch is read-only …

Thanks, Jan, but I have checked my Graylog server.log, see above. I see no indications of problems with it. Neither in the Elasticsearch logs, but I’m pretty much out of my depth with them, so I may well be missing something there.

I welcome your hypothesis that my Elasticsearch is read-only. But how can I check that?

the Graylog server.log (/var/log/graylog-server/server.log) should indicate that …

check System > overview if you can spot errors in that.

System/Overview says:

  • No notifications
  • System jobs: No active system jobs
  • Graylog cluster: displays cluster ID and outgoing traffic, which looks pretty dismal (see below)
  • Elasticsearch cluster: No failed indexing attempts in the last 24 hours.
  • Time configuration: looks good, everything in sync and identical
  • System messages: Nothing indicating an error or a problem

Here’s the Graylog cluster portion of System/Overview:

Graylog’s server.log file (/var/log/graylog/server.log) contains only lines like the following:
2019-10-25T08:00:46.666+02:00 WARN [LicenseChecker] License violation - Detected irregular traffic records
This line appears every five minutes. The only other line is one instance of ‘INFO [SessionsResource] Invalid username or password for user “admin”’ from when I mistyped the password.

Incidentally, in an attempt to gather further information, two days ago I configured rsyslog on the Graylog system I’ve been diagnosing to forward some logs to a downstream Graylog that I have played much less with so far. The behaviour is astonishingly (and disappointingly) similar: Graylog indexed 43,686 messages from 2019-10-23 15:34:42.000, when I enabled the input, until 2019-10-23 16:00:21.000, when it suddenly stopped. No user interaction whatsoever. Logs are still flowing, as can be seen by following tcpdump snippet:

2019-10-25 08:05:49.107786 IP a.b.c.2.39008 > a.b.c.1.10513: tcp 506
2019-10-25 08:05:49.107794 IP a.b.c.1.10513 > a.b.c.2.39008: tcp 0
2019-10-25 08:05:49.151487 IP a.b.c.2.39008 > a.b.c.1.10513: tcp 505
2019-10-25 08:05:49.151494 IP a.b.c.1.10513 > a.b.c.2.39008: tcp 0
2019-10-25 08:05:49.278365 IP a.b.c.2.39008 > a.b.c.1.10513: tcp 1012
2019-10-25 08:05:49.278371 IP a.b.c.1.10513 > a.b.c.2.39008: tcp 0
2019-10-25 08:05:49.287144 IP a.b.c.2.39008 > a.b.c.1.10513: tcp 505
2019-10-25 08:05:49.287149 IP a.b.c.1.10513 > a.b.c.2.39008: tcp 0

The input in question is still reporting traffic in its metrics:
Troughput / Metrics
1 minute average rate: 15 msg/s
Network IO: 5.9KiB 0B (total: 1.0GiB 0B )
Active connections: 2 (2 total)
Empty messages discarded: 0

I am beginning to suspect a bug in Graylog… This is Graylog 3.1.2 on both systems, on Debian GNU/Linux 9.

I have a new observation. I restarted Graylog on both systems and now the downstream system appears to be working as it should, while the upstream one is still exhibiting the problem. I have also noticed:

  1. The in/out-status at the top right of the Graylog window (left of help, i.e. “?” and logged on user) displays incoming messages on the problem system (i.e. upstream), but “out” stays at zero. On the other system (downstream), in and out always show identical numbers.

  2. System/Nodes shows a ton of unprocessed messages for the problem system, and their number keeps growing, while their number is pretty low on the other one. More specifically, right now, the readings are:
    upstream (problem): “The journal contains 570,144 unprocessed messages in 13 segments. 72 messages appended, 0 messages read in the last second.”
    downstream (working): " The journal contains 60 unprocessed messages in 1 segment. 43 messages appended, 43 messages read in the last second."

Does this observation help in any way?

I have now noticed that the problem system will successfully process queued, as yet unprocessed messages, when I restart Graylog. However, after a while it stops its processing entirely and the backlog of unprocessed messages quickly grows. I am now trying to determine how long it takes until it stops its processing and have restarted the Graylog server again…

Could there be a throttling issue? I’m experimenting with activating DEBUG level logs for the problem node. And in an effort to clear the backlog, I am restarting the Graylog server and disabling all inputs, causing it to speed up its data processing by several orders of magnitude. It manages to process around 70,000 messages until it stops. I’m at 463,959 messages right now, let’s see how many restarts I’ll need to get it to zero.

OK, the queue of unprocessed messages has been fully drained and I’ve turned on the inputs again. So far it’s looking good. Let’s see if it holds…

It held for around half an hour. A few minutes ago, “out” dropped to zero and has stayed there since. Messages are accumulating again, the system has just passed 6,000 unprocessed messages. What gives?!

Is elasticsearch running on the same server?
Is elasticsearch running when your ingest (out) drops to 0 and Graylog starts to queue messages?