Index Rotation needs manual Index Recalculation every day

Description of your problem

Index rotation happens at UTC every day, and the index ranges show “messages from 51 years ago”.

Logs are also not searchable within the web interface. The fix so far is to run an index recalculation of each Index set.

Description of steps you’ve taken to attempt to solve the issue

  • Restarted graylog-server on all nodes to ensure started after mongodb
  • Confirm that all clients connecting are time sync’d - although this is our dev logging environment and this is difficult to manage ongoing.
  • Fixed all timestamp index errors and all indexing errors. We do currently have some timestamp WARNings. However these issues come and go and we are working towards getting rid of them.

Environmental information

Information about our AWS setup:

  • graylog-server (version 3.3.12) cluster - running on ubuntu xenial hosts
    • 1 graylog-server master
    • 2 graylog-server slaves - accepts logs via Network Load Balancer
  • Mongodb cluster (version 4.0 ) - 3 nodes - running on ubuntu xenial hosts
  • Elasticsearch cluster (version 6.8.8) - running on Ubuntu Xenial hosts, 3 Master nodes, 4 data nodes.

Snippet of Index rotation output

2021-11-19T00:00:45.218Z INFO  [SystemJobManager] SystemJob <ac5cda10-48cb-11ec-ba22-0a491ee5d31d> [org.graylog2.indexer.indices.jobs.SetIndexReadOnlyAndCalculateRangeJob] finished in 320ms.
2021-11-19T00:00:45.520Z INFO  [SetIndexReadOnlyJob] Flushing old index <gcm_usqa_560>.
2021-11-19T00:00:45.680Z INFO  [SetIndexReadOnlyJob] Setting old index <gcm_usqa_560> to read-only.
2021-11-19T00:00:45.768Z INFO  [SystemJobManager] Submitted SystemJob <bec36480-48cb-11ec-ba22-0a491ee5d31d> [org.graylog2.indexer.indices.jobs.OptimizeIndexJob]
2021-11-19T00:00:45.774Z INFO  [OptimizeIndexJob] Optimizing index <gcm_usqa_560>.
2021-11-19T00:00:45.783Z INFO  [CreateNewSingleIndexRangeJob] Calculating ranges for index gcm_usqa_560.
2021-11-19T00:00:45.787Z INFO  [MongoIndexRangeService] Calculated range of [gcm_usqa_560] in [2ms].
2021-11-19T00:00:45.788Z INFO  [CreateNewSingleIndexRangeJob] Created ranges for index gcm_usqa_560.
2021-11-19T00:00:45.800Z INFO  [SystemJobManager] SystemJob <acb903d0-48cb-11ec-ba22-0a491ee5d31d> [org.graylog2.indexer.indices.jobs.SetIndexReadOnlyAndCalculateRangeJob] finished in 299ms.
2021-11-19T00:00:46.148Z INFO  [SetIndexReadOnlyJob] Flushing old index <tech-dev_253>.
2021-11-19T00:00:46.265Z INFO  [SetIndexReadOnlyJob] Setting old index <tech-dev_253> to read-only.
2021-11-19T00:00:46.352Z INFO  [SystemJobManager] Submitted SystemJob <bf1c8100-48cb-11ec-ba22-0a491ee5d31d> [org.graylog2.indexer.indices.jobs.OptimizeIndexJob]
2021-11-19T00:00:46.372Z INFO  [CreateNewSingleIndexRangeJob] Calculating ranges for index tech-dev_253.
2021-11-19T00:00:46.372Z INFO  [OptimizeIndexJob] Optimizing index <tech-dev_253>.
2021-11-19T00:00:46.430Z INFO  [MongoIndexRangeService] Calculated range of [tech-dev_253] in [56ms].
2021-11-19T00:00:46.431Z INFO  [CreateNewSingleIndexRangeJob] Created ranges for index tech-dev_253.
2021-11-19T00:00:46.444Z INFO  [SystemJobManager] SystemJob <ad166610-48cb-11ec-ba22-0a491ee5d31d> [org.graylog2.indexer.indices.jobs.SetIndexReadOnlyAndCalculateRangeJob] finished in 330ms.

graylog-server worker node config

graylog-server config from worker node
elasticsearch_hosts = http://elasticsearch.MYDOMAIN.com:9200
mongodb_uri = mongodb+srv://mongodb.MYDOMAIN.com/graylog?ssl=false
http_bind_address = 10.15.0.218:9000
is_master = false
node_id_file = /etc/graylog/server/node-id
password_secret = <REMOVED>
root_username = admin
root_password_sha2 = <REMOVED>
root_email = "<MASKED>"
versionchecks = false

bin_dir = /usr/share/graylog-server/bin
data_dir = /var/lib/graylog-server
plugin_dir = /usr/share/graylog-server/plugin

elasticsearch_discovery_enabled = true
rotation_strategy = time
elasticsearch_max_time_per_index = 1d
elasticsearch_max_number_of_indices = 20
retention_strategy = delete
elasticsearch_shards = 3
elasticsearch_replicas = 2
elasticsearch_index_prefix = graylog
allow_leading_wildcard_searches = false
allow_highlighting = false
elasticsearch_analyzer = standard
output_batch_size = 500
output_flush_interval = 1
output_fault_count_threshold = 5
output_fault_penalty_seconds = 30
processbuffer_processors = 5
outputbuffer_processors = 3
processor_wait_strategy = blocking
ring_size = 65536
inputbuffer_ring_size = 65536
inputbuffer_processors = 5
inputbuffer_wait_strategy = blocking

message_journal_enabled = true
message_journal_dir = /var/lib/graylog-server/journal
message_journal_max_age = 96h
message_journal_max_size = 100gb

lb_recognition_period_seconds = 30
mongodb_max_connections = 1000
mongodb_threads_allowed_to_block_multiplier = 5
proxied_requests_thread_pool_size = 32

Hello,

I might be able to help and it seams that you have a large environment. First thing I would look at is your configuration first. If the configuration are correct meaning there are no errors then we can move on from there. I did find a couple things that didn’t look right, but I might be wrong.

So that would be the year 1970. If all your nodes are Linux it is suggested that the service ntp or something similar for time sync’ing be installed on all nodes.

I counted 13 nodes? Is this correct? or do you have Graylog/MongoDb on the same nodes?

Thank you for sharing you configuration file and with a quick glance I don’t see these setting configured in you server.config.

elasticsearch_hosts = http://192.168.1.12:9200, http://192.168.1.13:9200, http://192.168.1.14:9200
mongodb_uri = mongodb://grayloguser:secret@localhost:27017,localhost:27018,localhost:27019/graylog?replicaSet=rs01

By chance, did this setup work before or did you just run into problems recently?
Have you checked you elasticsearch cluster? if so what does the status show by ​using this command? You may need to replace localhost.

curl -XGET http://localhost:9200/_cluster/health?pretty=true

Have you check you MongoDb status? You can login by using mongo shell and typing rs.status();

[root@graylog elasticsearch]# mongo 

> use graylog;
switched to db graylog
> rs.status();
{
       ​"ok" : 0,
       ​"errmsg" : "not running with --replSet",
       ​"code" : 76,
       ​"codeName" : "NoReplicationEnabled"
}
>

What ever you can show us would be great. Help us , help you.

1 Like

Thanks for your help.

I counted 13 nodes? Is this correct? or do you have Graylog/MongoDb on the same nodes?

Yes, there’s a total of 13 nodes in our setup. Graylog server is running on different nodes to the mongodb cluster.

If all your nodes are Linux it is suggested that the service ntp or something similar for time sync’ing be installed on all nodes.

Checking NTP is running is a great idea, and I confirm that all 13 nodes are running ntp. System time is correct and not crazily outside the NTP drift config.

By chance, did this setup work before or did you just run into problems recently?

Good question - we’ve always had this problem. So it could be a config issue or a security group that’s not open.
That said, logs are flowing, graylog web interface works fine.

with a quick glance I don’t see these setting configured in you server.config.

elasticsearch_hosts = http://elasticsearch.MYDOMAIN.com:9200
mongodb_uri = mongodb+srv://mongodb.MYDOMAIN.com/graylog?ssl=false

It’s good that you queried this, as I realise I left out some more details.
Elasticsearch has discovery enabled, as per this setting

elasticsearch_discovery_enabled = true

And elasticsearch is behind an AWS NLB, here is a dig of the DNS entry:

elasticsearch.<DOMAIN>.com. 300	IN	CNAME	elasticsearch-nlb-0xxxxxxxx3.elb.us-east-1.amazonaws.com.
elasticsearch-nlb-0xxxxxxxx3.elb.us-east-1.amazonaws.com.	60 IN A	10.37.YY.33
elasticsearch-nlb-0xxxxxxxx3.elb.us-east-1.amazonaws.com.	60 IN A	10.37.YZ.48
elasticsearch-nlb-0xxxxxxxx3.elb.us-east-1.amazonaws.com.	60 IN A	10.37.YX.109

And here is our mongodb TXT record for our replica set:

mongodb.MYDOMAIN.com.	300	IN	TXT	"replicaSet=rs0&authSource=admin"

Have you checked you elasticsearch cluster?
Yes, cluster is green, here’s the output:

~# curl -XGET http://elasticsearch.<DOMAIN>.com:9200/_cluster/health?pretty=true
{
  "cluster_name" : "logging",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 7,
  "number_of_data_nodes" : 4,
  "active_primary_shards" : 760,
  "active_shards" : 2254,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Have you check you MongoDb status?

Yes, here’s the output:

mongodb status node config
rs0:PRIMARY> rs.status()
{
	"set" : "rs0",
	"date" : ISODate("2021-11-26T00:27:24.278Z"),
	"myState" : 1,
	"term" : NumberLong(2),
	"syncingTo" : "",
	"syncSourceHost" : "",
	"syncSourceId" : -1,
	"heartbeatIntervalMillis" : NumberLong(2000),
	"optimes" : {
		"lastCommittedOpTime" : {
			"ts" : Timestamp(1637886444, 6),
			"t" : NumberLong(2)
		},
		"readConcernMajorityOpTime" : {
			"ts" : Timestamp(1637886444, 6),
			"t" : NumberLong(2)
		},
		"appliedOpTime" : {
			"ts" : Timestamp(1637886444, 7),
			"t" : NumberLong(2)
		},
		"durableOpTime" : {
			"ts" : Timestamp(1637886444, 6),
			"t" : NumberLong(2)
		}
	},
	"lastStableCheckpointTimestamp" : Timestamp(1637886387, 1),
	"electionCandidateMetrics" : {
		"lastElectionReason" : "electionTimeout",
		"lastElectionDate" : ISODate("2021-04-15T04:58:16.144Z"),
		"electionTerm" : NumberLong(2),
		"lastCommittedOpTimeAtElection" : {
			"ts" : Timestamp(1618462683, 1),
			"t" : NumberLong(1)
		},
		"lastSeenOpTimeAtElection" : {
			"ts" : Timestamp(1618462683, 1),
			"t" : NumberLong(1)
		},
		"numVotesNeeded" : 2,
		"priorityAtElection" : 1,
		"electionTimeoutMillis" : NumberLong(10000),
		"numCatchUpOps" : NumberLong(0),
		"newTermStartDate" : ISODate("2021-04-15T04:58:16.149Z"),
		"wMajorityWriteAvailabilityDate" : ISODate("2021-04-15T04:58:16.307Z")
	},
	"members" : [
		{
			"_id" : 1,
			"name" : "mongo2.<DOMAIN>:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 7505701,
			"optime" : {
				"ts" : Timestamp(1637886443, 16),
				"t" : NumberLong(2)
			},
			"optimeDurable" : {
				"ts" : Timestamp(1637886443, 16),
				"t" : NumberLong(2)
			},
			"optimeDate" : ISODate("2021-11-26T00:27:23Z"),
			"optimeDurableDate" : ISODate("2021-11-26T00:27:23Z"),
			"lastHeartbeat" : ISODate("2021-11-26T00:27:23.959Z"),
			"lastHeartbeatRecv" : ISODate("2021-11-26T00:27:23.388Z"),
			"pingMs" : NumberLong(0),
			"lastHeartbeatMessage" : "",
			"syncingTo" : "mongo0.<DOMAIN>:27017",
			"syncSourceHost" : "mongo0.<DOMAIN>:27017",
			"syncSourceId" : 3,
			"infoMessage" : "",
			"configVersion" : 6
		},
		{
			"_id" : 2,
			"name" : "mongo1.<DOMAIN>:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 67166,
			"optime" : {
				"ts" : Timestamp(1637886443, 6),
				"t" : NumberLong(2)
			},
			"optimeDurable" : {
				"ts" : Timestamp(1637886443, 6),
				"t" : NumberLong(2)
			},
			"optimeDate" : ISODate("2021-11-26T00:27:23Z"),
			"optimeDurableDate" : ISODate("2021-11-26T00:27:23Z"),
			"lastHeartbeat" : ISODate("2021-11-26T00:27:23.256Z"),
			"lastHeartbeatRecv" : ISODate("2021-11-26T00:27:23.652Z"),
			"pingMs" : NumberLong(1),
			"lastHeartbeatMessage" : "",
			"syncingTo" : "mongo2.<DOMAIN>:27017",
			"syncSourceHost" : "mongo2.<DOMAIN>:27017",
			"syncSourceId" : 1,
			"infoMessage" : "",
			"configVersion" : 6
		},
		{
			"_id" : 3,
			"name" : "mongo0.<DOMAIN>:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 19423941,
			"optime" : {
				"ts" : Timestamp(1637886444, 7),
				"t" : NumberLong(2)
			},
			"optimeDate" : ISODate("2021-11-26T00:27:24Z"),
			"syncingTo" : "",
			"syncSourceHost" : "",
			"syncSourceId" : -1,
			"infoMessage" : "",
			"electionTime" : Timestamp(1618462696, 1),
			"electionDate" : ISODate("2021-04-15T04:58:16Z"),
			"configVersion" : 6,
			"self" : true,
			"lastHeartbeatMessage" : ""
		}
	],
	"ok" : 1,
	"operationTime" : Timestamp(1637886444, 7),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1637886444, 7),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
}
rs0:PRIMARY>

More config files

elasticsearch master node config
# ---------------------------------- Cluster -----------------------------------
# Use a descriptive name for your cluster:
cluster.name: logging
# ------------------------------------ Node ------------------------------------
node.name: es_master_${HOSTNAME}
node.master: true
node.data: false
node.ingest: true
node.ml: false
xpack.ml.enabled: true
cluster.remote.connect: false
# ----------------------------------- Paths ------------------------------------
# Path to directory where to store the data (separate multiple locations by comma):
path.data: /var/lib/elasticsearch
# Path to log files:
path.logs: /var/log/elasticsearch
# ---------------------------------- Network -----------------------------------
network.host: _ec2_
# --------------------------------- Discovery ----------------------------------
discovery.zen.ping.unicast.hosts: []

discovery.zen.hosts_provider: ec2
discovery.ec2.any_group: true
discovery.ec2.host_type: private_ip
cloud.node.auto_attributes: true
cluster.routing.allocation.awareness.attributes: aws_availability_zone
cluster.routing.allocation.disk.watermark.low: "90%"
cluster.routing.allocation.disk.watermark.high: "95%"
discovery.ec2.tag.es_cluster: logging
discovery.ec2.endpoint: ec2.us-east-1.amazonaws.com
logger.org.elasticsearch.discovery: TRACE
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
discovery.zen.minimum_master_nodes: 1
elasticsearch data node config
# ---------------------------------- Cluster -----------------------------------
# Use a descriptive name for your cluster:
cluster.name: logging
# ------------------------------------ Node ------------------------------------
# Use a descriptive name for the node:
node.name: es_data_${HOSTNAME}
node.master: false
node.data: true
node.ingest: false
node.ml: false
cluster.remote.connect: false
# ----------------------------------- Paths ------------------------------------
# Path to directory where to store the data (separate multiple locations by comma):
path.data: /var/lib/elasticsearch
# Path to log files:
path.logs: /var/log/elasticsearch
# ---------------------------------- Network -----------------------------------
network.host: _ec2_
discovery.zen.ping.unicast.hosts: []
discovery.zen.hosts_provider: ec2
discovery.ec2.any_group: true
discovery.ec2.host_type: private_ip
cloud.node.auto_attributes: true
cluster.routing.allocation.awareness.attributes: aws_availability_zone
cluster.routing.allocation.disk.watermark.low: "90%"
cluster.routing.allocation.disk.watermark.high: "95%"
discovery.ec2.tag.es_cluster: logging
discovery.ec2.endpoint: ec2.us-east-1.amazonaws.com
logger.org.elasticsearch.discovery: TRACE

# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
discovery.zen.minimum_master_nodes: 1

Hello,

Thanks for the added details. You do have a large setup for sure.
So the main topic is:

And

Graylog only shows what it can get from elasticsearch. I’m not 100% sure but if you have to manually rotate/recalculate your index then I think there is a communication break down. That being said, I would look into you elasticsearch configuration. Then if possible you may need to adjust you Graylog file as shown above. I think if you fix the index issue then the time issue might be solved.

I see your using AWS. Those other configurations in there ES set up some I have seen but not for Graylog. Maybe something new.

As for Graylog, and this is coming from the documentation for a multi node cluster as shown below.

The Elasticsearch servers need one IP that can be reached over network set in network.host and some participants of the cluster in discovery.zen.ping.unicast.hosts. That is enough to have a minimal cluster setup.

I’m not 100% sure if you need these configuration in your ES config, But my other lab setups for OpenSearch and Open Distro for Elasticsearch and my ELK I did have to configure these as shown below.

Example:

cluster.name: graylog
network.host: <host IP-Add>
http.port: 9200
node.name: lab-elastic-001
node.master: true
node.data: false
discovery.zen.ping.unicast.hosts: ["lab-elastic-002","lab-elastic-003","lab-elastic-004","lab-elastic-005"]
discovery.zen.minimum_master_nodes: 2  

As for Mongdb should look something like this in graylog configuration file.

mongodb_uri = mongodb://USERNAME:PASSWORD@mongodb-node01:27017,mongodb-node02:27017,mongodb-node03:27017/graylog?replicaSet=rs01

I would imagine if your Primary MongDb server crashes then how does Graylog know to use the other two if you only have one server configured? Also Your not using port 27017 in MongoDb URL?

I did a mockup of the information you gave me from above on what I would have done.

mongodb_uri = mongodb://mongo2.<DOMAIN>:27017,mongo1.<DOMAIN>:27017,mongo0.<DOMAIN>:27017/graylog?replicaSet=rs0

The reason I ask these question about your configuration is the way you have it set up for communicating with the other nodes It doesn’t seem fault redundant unless you have other hardware set up in front of your cluster Or AWS is using dark magic.

Having DNS entry is good that can rule out some troubleshooting tips. But the question is did you check if you Elasticsearch nodes and can communicate with each other? On your ES master node this command should show all you nodes. This would go along with your Graylog configuration file.

curl -XGET http://localhost:9200/_nodes?pretty

If that looks good my next question would be.

If you only have one elasticsearch server configured in graylog server.conf then how does Graylog know to use your other Elasticsearch servers if that one goes down? Yes I under stand about the discover is set to true. and I quote.

The most important setting to make a successful connection is a list of comma-separated URIs to one or more Elasticsearch nodes.To avoid issues with the connection to the Elasticsearch cluster you should add some of the network addresses of the Elasticsearch nodes to elasticsearch_hosts.

Looking at MongoDb I really don’t see and issue. MongoDb basically just hold all the metadata for Graylog.

Have you done a controlled test? Meaning, shutdown your Elasticsearch server called

http://elasticsearch.MYDOMAIN.com:9200

and your MongoDb server called

mongodb_uri = mongodb+srv://mongodb.MYDOMAIN.com/graylog?ssl=false

Then does your cluster setup still work and can the Web UI function correctly?
I believe that the reason why we make a cluster so its fault redundant.
If that works an no errors/warnings in the log files from that and Web UI is responding without any issues. Perhaps we can look else where for this issue.

Here is the documentation from the above statements.

https://docs.graylog.org/v1/docs/multinode-setup

https://docs.graylog.org/v1/docs/elasticsearch

Sorry I cant give a direct answer to your question, You have configurations that seam to be out of place.
So my questions are focusing on your configurations you have and understand how you environment works.

EDIT: Just an FYI

Graylog only shows what it can get from elasticsearch.

Ok this is good to know. This pinpoints to it being an elasticsearch config issue.

The Elasticsearch servers need one IP that can be reached over network set in network.host and some participants of the cluster in discovery.zen.ping.unicast.hosts . That is enough to have a minimal cluster setup.

Good find, looks like my elasticsearch setting of:

network.host: _ec2_

might be incorrect. Let me follow this up and I’ll get back to you.

I will also get back to you with those elasticsearch master commands.

I did subsequently notice that sometimes there appears to be an extra index created at rotation:

But I’m currently chasing up whether the hosts sending these logs are NTP sync’d.