TimeZone calculations totally messed up?

Hi!

Something seems to be fundamentally broken with Graylog time.

All my servers are NTP synchronized with UTC clock. I have checked that all servers in the cluster have the lock synchronized.

Despite this, somehow messages are been captured about 15 minutes in the future and presented 2hours in the past i my interface.

Yes, the interface is empty for the first 2 hours, then messages are been displayed at original UTC time, plus some 15 minutes (no idea where those minutes come from), despite de fact that I am in UTC+2.

Most software that deals with time (including elasticsearch, java) store time internally with an universal epoch-like time data type. Times are transformed before storage and then are presented however the browser timezone reports, however the user wants.

It is very difficult to understand what is going on inside graylog, but I am assuming that it works with time like elasticsearch and from that assumption what I get seems just absurd.

I also suspect that perhaps something is broken, because this was not happening before.

What could be going on?? Any help is appreciated.

For example:

My computer clock is 10:35 now (this is UTC+2) in my servers it is 8.35 UTC.

I see my last message at 8.42 + 00 in the search (it should be printed at 10.42 my timezone, but it does not), comes from the server Dev2. The histogram in search screen shows empty from 8.42 to 10.35.

I wonder if the clock in Dev2 is too fast, but when I enter and check the time is 8.35 UTC. Perfectly in sync.

So, there are 2 things here:

  • How come that the search is not presenting the results in my timezone? I am sure this worked before.
  • Where in the chain are logs been advanced some minutes in the futre?

For some reason, the problem fixed by itself.

everything is back to normal this morning, without my intervention.

this is very strange.

and now doing it again… here is how it looks like, thou now its only 1h delta

this has to be some kind of bug.

How graylog save timestamps it depends on log format. Graylog try to use date/time from log date/time from message somehow. If you use syslog, but log message doesn’t contains timezone in date-time (Syslog RFC3164), graylog save it as UTC. If you use syslog RFC5424 messages with timezone, graylog uses it. Graylog alwas store messages in UTC timestamps in Elastic Search database and render it based on user’s timezone.

UTC for graylog is based on graylog server’s time. So first check your time and timezone settings in graylog linux box. Check output of command: timedatectl. Check Local time, Time zone and if clock is synchronized. Also if you use admin account to log into graylog interface, setup correct timezone in /etc/graylog/server/server.conf parameter root_timezone

If you use another user to log in into web interface of graylog, setup appropriate timezone in user setting, or profile:

thanks @shoothub

All my servers have the same time conf due to ansible automation, the logger says:

      Local time: Mon 2020-11-02 09:32:26 UTC
  Universal time: Mon 2020-11-02 09:32:26 UTC
        RTC time: Mon 2020-11-02 09:32:26
       Time zone: Etc/UTC (UTC, +0000)
 Network time on: yes
NTP synchronized: yes
 RTC in local TZ: no

Also, my graylog deployement is straight 3.3.8-1 docker. I don’t think I have configured anything related to time.

But now I am thinking that this is a performance issue. Because the delta of time changes over time, sometimes is 8 hours, sometimes 1, sometimes 0.

The virtual machine where garylog sits is not particularly performant, but my environment is not particularely big: about 15 vms with docker services inside, this is just a test/dev environment.

Looking at the graylog VM does not seem particularly busy: with 30% CPU usage, average load around 2 and 50% memory usage (4GB mem, 2 cores).

However, I notice that graylog is constantly at 100% CPU (I guess using 1 core constantly).

I wonder if there is a bottleneck there somehow and graylog cannot keep up. I still find it surprising, never experienced this before on version 2.

Look @shoothub I can see that I have:

**180,258 unprocessed messages** are currently in the journal, in 12 segments.

However both Graylog and Elastic Search are unsing around 50% of their heaps or less.

So, nothing seems to be a bottleneck! that is very strange!

I think what it is happening here must be that the “processing” stage is being held up by async calls that result in external events.

Meaning that Graylog is waiting (that’s why I dont see any bottleneck) for something to happen.

Perhaps this lookup tables? os similar. I have a firewall and it is only possible to go out via proxy. Maybe something like that.

Just in case anybody else suffers this, I think I found what the problem is:

I noticed this log from one of my dns servers

[250499.508394] nf_conntrack: nf_conntrack: table full, dropping packet

When looking for information about the error is normally associated with Denial of Service Attacks, basically the kernel cannot keep up with requests for IP connectivity.

I tcpdumped inside that DNS VM and our Graylog friend is overwhelming it.

Now I remember that I activated reverse dns resolution in my Syslog input. But I would have never thought that an individual DNS query was going to be generated for every single log entry, I would assume that some form of DNS caching would be used, it seems not to be the case.

The processing queue is now 0. And I will watch the system to see if that was the root case…

Still my IPs are mainly dinamic due to orchestration, and would like to have some kind of reverse lookup that is more suited to this use case.

do you know anyway to get that @shoothub?

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.