Upgraded to Graylog OS 4.1.2 from 4.1.1 - it's busted (Elasticsearch, I think)

Alright, so I had a working Graylog 4.1.1 install on Ubuntu 20.04.
I am super new to Graylog/ES/MongoDB and I’m decent with Linux, but not amazing.
I upgraded to 4.1.2 and now I have some issues. The actual upgrade via APT went smoothly and without error.

I can log into Graylog, but when I get on the search page I see this:

While retrieving data for this widget, the following error(s) occurred:
Request cannot be executed; I/O reactor status: STOPPED.

When checking the server.log I see this repeated:

2021-07-30T00:14:07.534Z ERROR [IndexRangesCleanupPeriodical] Uncaught exception in periodical
org.graylog.shaded.elasticsearch7.org.elasticsearch.ElasticsearchException: An error occurred: 
	at org.graylog.storage.elasticsearch7.ElasticsearchClient.exceptionFrom(ElasticsearchClient.java:136) ~[?:?]
	at org.graylog.storage.elasticsearch7.ElasticsearchClient.execute(ElasticsearchClient.java:99) ~[?:?]
	at org.graylog.storage.elasticsearch7.ElasticsearchClient.execute(ElasticsearchClient.java:92) ~[?:?]
	at org.graylog.storage.elasticsearch7.ClusterAdapterES7.isConnected(ClusterAdapterES7.java:162) ~[?:?]
	at org.graylog2.indexer.cluster.Cluster.isConnected(Cluster.java:115) ~[graylog.jar:?]
	at org.graylog2.periodical.IndexRangesCleanupPeriodical.doRun(IndexRangesCleanupPeriodical.java:70) ~[graylog.jar:?]
	at org.graylog2.plugin.periodical.Periodical.run(Periodical.java:77) [graylog.jar:?]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_292]
	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [?:1.8.0_292]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_292]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [?:1.8.0_292]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_292]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_292]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_292]
Caused by: java.lang.RuntimeException: Request cannot be executed; I/O reactor status: STOPPED
	at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestClient.extractAndWrapCause(RestClient.java:857) ~[?:?]
	at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestClient.performRequest(RestClient.java:259) ~[?:?]
	at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestClient.performRequest(RestClient.java:246) ~[?:?]
	at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestHighLevelClient.internalPerformRequest(RestHighLevelClient.java:1613) ~[?:?]
	at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:1583) ~[?:?]
	at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:1553) ~[?:?]
	at org.graylog.shaded.elasticsearch7.org.elasticsearch.client.ClusterClient.health(ClusterClient.java:130) ~[?:?]
	at org.graylog.storage.elasticsearch7.ClusterAdapterES7.lambda$isConnected$9(ClusterAdapterES7.java:162) ~[?:?]
	at org.graylog.storage.elasticsearch7.ElasticsearchClient.execute(ElasticsearchClient.java:97) ~[?:?]
	... 12 more

Node status says this:

I’m honestly not 100% sure where else to look.

What are you seeing in the Elasticsearch logs? This looks like Elasticsearch is down, or unable to ingest logs. My cluster is running 4.1.2 and doesn’t exhibit the same behavior.

Sooo…super new to elasticsearch, so I may not be looking in the right area.

If I dig into /var/log/elasticsearch/graylog then there’s nothing there.
If I go up one level, then I see no logs. I will dig more on Monday.

Hmmmm if there’s nothing in /var/log/elasticsearch/graylog.log, then that’s highly suspicious to me and leads me to think that it’s not running. How about a systemctl status elasticsearch?

● elasticsearch.service - Elasticsearch
     Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2021-07-29 19:13:47 UTC; 3 days ago
       Docs: https://www.elastic.co
   Main PID: 877 (java)
      Tasks: 50 (limit: 19110)
     Memory: 1.5G
     CGroup: /system.slice/elasticsearch.service
             └─877 /usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.encoding=UTF-8 -Djna.nosys=true -XX:-OmitStackTraceInFastThrow -XX>

Jul 29 19:13:28 graylog systemd[1]: Starting Elasticsearch...
Jul 29 19:13:47 graylog systemd[1]: Started Elasticsearch.

It’s running. I’m not sure why there’s no log.

:man_facepalming:
It just wasn’t showing up when I did ls -a but I did a sudo cat /var/log/elasticsearch/graylog.log sure enough it spit out text.

Sooo…I restarted Elasticsearch and this is what it says in the log (tail of the log, anyway):

.e.p.PluginsService     ] [graylog] loaded module [parent-join]
[2021-08-02T15:16:58,298][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [percolator]
[2021-08-02T15:16:58,299][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [rank-eval]
[2021-08-02T15:16:58,299][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [reindex]
[2021-08-02T15:16:58,299][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [repository-url]
[2021-08-02T15:16:58,299][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [systemd]
[2021-08-02T15:16:58,299][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [transport-netty4]
[2021-08-02T15:16:58,299][INFO ][o.e.p.PluginsService     ] [graylog] no plugins loaded
[2021-08-02T15:16:58,324][INFO ][o.e.e.NodeEnvironment    ] [graylog] using [1] data paths, mounts [[/ (/dev/mapper/ubuntu--vg-ubuntu--lv)]], net usable_space [170.4gb], net total_space [194.8gb], types [ext4]
[2021-08-02T15:16:58,324][INFO ][o.e.e.NodeEnvironment    ] [graylog] heap size [1gb], compressed ordinary object pointers [true]
[2021-08-02T15:16:58,402][INFO ][o.e.n.Node               ] [graylog] node name [graylog], node ID [v7mjcc1SQ4q4YCv1NCBxBQ], cluster name [graylog], roles [master, remote_cluster_client, data, ingest]
[2021-08-02T15:17:00,715][INFO ][o.e.t.NettyAllocator     ] [graylog] creating NettyAllocator with the following configs: [name=unpooled, suggested_max_allocation_size=256kb, factors={es.unsafe.use_unpooled_allocator=null, g1gc_enabled=true, g1gc_region_size=1mb, heap_size=1gb}]
[2021-08-02T15:17:00,761][INFO ][o.e.d.DiscoveryModule    ] [graylog] using discovery type [zen] and seed hosts providers [settings]
[2021-08-02T15:17:00,962][WARN ][o.e.g.DanglingIndicesState] [graylog] gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually
[2021-08-02T15:17:01,096][INFO ][o.e.n.Node               ] [graylog] initialized
[2021-08-02T15:17:01,097][INFO ][o.e.n.Node               ] [graylog] starting ...
[2021-08-02T15:17:01,186][INFO ][o.e.t.TransportService   ] [graylog] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2021-08-02T15:17:01,413][WARN ][o.e.b.BootstrapChecks    ] [graylog] the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
[2021-08-02T15:17:01,415][INFO ][o.e.c.c.Coordinator      ] [graylog] cluster UUID [kiupN748Qg-WIaaUGCeZ8g]
[2021-08-02T15:17:01,423][INFO ][o.e.c.c.ClusterBootstrapService] [graylog] no discovery configuration found, will perform best-effort cluster bootstrapping after [3s] unless existing master is discovered
[2021-08-02T15:17:01,493][INFO ][o.e.c.s.MasterService    ] [graylog] elected-as-master ([1] nodes joined)[{graylog}{v7mjcc1SQ4q4YCv1NCBxBQ}{5ljDAR1uQA2UQLE2CPLMRw}{127.0.0.1}{127.0.0.1:9300}{dimr} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 23, version: 772, delta: master node changed {previous [], current [{graylog}{v7mjcc1SQ4q4YCv1NCBxBQ}{5ljDAR1uQA2UQLE2CPLMRw}{127.0.0.1}{127.0.0.1:9300}{dimr}]}
[2021-08-02T15:17:01,549][INFO ][o.e.c.s.ClusterApplierService] [graylog] master node changed {previous [], current [{graylog}{v7mjcc1SQ4q4YCv1NCBxBQ}{5ljDAR1uQA2UQLE2CPLMRw}{127.0.0.1}{127.0.0.1:9300}{dimr}]}, term: 23, version: 772, reason: Publication{term=23, version=772}
[2021-08-02T15:17:01,563][INFO ][o.e.h.AbstractHttpServerTransport] [graylog] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2021-08-02T15:17:01,563][INFO ][o.e.n.Node               ] [graylog] started
[2021-08-02T15:17:01,751][INFO ][o.e.g.GatewayService     ] [graylog] recovered [13] indices into cluster_state
[2021-08-02T15:17:03,866][INFO ][o.e.c.r.a.AllocationService] [graylog] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[graylog_0][0]]]).
psdadmin@graylog:~$ sudo tail /var/log/elasticsearch/graylog.log -c 40000
[2021-08-02T15:16:54,843][INFO ][o.e.n.Node               ] [graylog] stopping ...
[2021-08-02T15:16:54,896][INFO ][o.e.n.Node               ] [graylog] stopped
[2021-08-02T15:16:54,897][INFO ][o.e.n.Node               ] [graylog] closing ...
[2021-08-02T15:16:54,903][INFO ][o.e.n.Node               ] [graylog] closed
[2021-08-02T15:16:57,636][INFO ][o.e.n.Node               ] [graylog] version[7.10.2], pid[141153], build[oss/deb/747e1cc71def077253878a59143c1f785afa92b9/2021-01-13T00:42:12.435326Z], OS[Linux/5.4.0-80-generic/amd64], JVM[AdoptOpenJDK/OpenJDK 64-Bit Server VM/15.0.1/15.0.1+9]
[2021-08-02T15:16:57,638][INFO ][o.e.n.Node               ] [graylog] JVM home [/usr/share/elasticsearch/jdk], using bundled JDK [true]
[2021-08-02T15:16:57,638][INFO ][o.e.n.Node               ] [graylog] JVM arguments [-Xshare:auto, -Des.networkaddress.cache.ttl=60, -Des.networkaddress.cache.negative.ttl=10, -XX:+AlwaysPreTouch, -Xss1m, -Djava.awt.headless=true, -Dfile.encoding=UTF-8, -Djna.nosys=true, -XX:-OmitStackTraceInFastThrow, -XX:+ShowCodeDetailsInExceptionMessages, -Dio.netty.noUnsafe=true, -Dio.netty.noKeySetOptimization=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Dio.netty.allocator.numDirectArenas=0, -Dlog4j.shutdownHookEnabled=false, -Dlog4j2.disable.jmx=true, -Djava.locale.providers=SPI,COMPAT, -Xms1g, -Xmx1g, -XX:+UseG1GC, -XX:G1ReservePercent=25, -XX:InitiatingHeapOccupancyPercent=30, -Djava.io.tmpdir=/tmp/elasticsearch-15773097345297971763, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/elasticsearch, -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log, -Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m, -XX:MaxDirectMemorySize=536870912, -Des.path.home=/usr/share/elasticsearch, -Des.path.conf=/etc/elasticsearch, -Des.distribution.flavor=oss, -Des.distribution.type=deb, -Des.bundled_jdk=true]
[2021-08-02T15:16:58,297][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [aggs-matrix-stats]
[2021-08-02T15:16:58,297][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [analysis-common]
[2021-08-02T15:16:58,297][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [geo]
[2021-08-02T15:16:58,297][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [ingest-common]
[2021-08-02T15:16:58,297][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [ingest-geoip]
[2021-08-02T15:16:58,298][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [ingest-user-agent]
[2021-08-02T15:16:58,298][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [kibana]
[2021-08-02T15:16:58,298][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [lang-expression]
[2021-08-02T15:16:58,298][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [lang-mustache]
[2021-08-02T15:16:58,298][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [lang-painless]
[2021-08-02T15:16:58,298][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [mapper-extras]
[2021-08-02T15:16:58,298][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [parent-join]
[2021-08-02T15:16:58,298][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [percolator]
[2021-08-02T15:16:58,299][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [rank-eval]
[2021-08-02T15:16:58,299][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [reindex]
[2021-08-02T15:16:58,299][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [repository-url]
[2021-08-02T15:16:58,299][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [systemd]
[2021-08-02T15:16:58,299][INFO ][o.e.p.PluginsService     ] [graylog] loaded module [transport-netty4]
[2021-08-02T15:16:58,299][INFO ][o.e.p.PluginsService     ] [graylog] no plugins loaded
[2021-08-02T15:16:58,324][INFO ][o.e.e.NodeEnvironment    ] [graylog] using [1] data paths, mounts [[/ (/dev/mapper/ubuntu--vg-ubuntu--lv)]], net usable_space [170.4gb], net total_space [194.8gb], types [ext4]
[2021-08-02T15:16:58,324][INFO ][o.e.e.NodeEnvironment    ] [graylog] heap size [1gb], compressed ordinary object pointers [true]
[2021-08-02T15:16:58,402][INFO ][o.e.n.Node               ] [graylog] node name [graylog], node ID [v7mjcc1SQ4q4YCv1NCBxBQ], cluster name [graylog], roles [master, remote_cluster_client, data, ingest]
[2021-08-02T15:17:00,715][INFO ][o.e.t.NettyAllocator     ] [graylog] creating NettyAllocator with the following configs: [name=unpooled, suggested_max_allocation_size=256kb, factors={es.unsafe.use_unpooled_allocator=null, g1gc_enabled=true, g1gc_region_size=1mb, heap_size=1gb}]
[2021-08-02T15:17:00,761][INFO ][o.e.d.DiscoveryModule    ] [graylog] using discovery type [zen] and seed hosts providers [settings]
[2021-08-02T15:17:00,962][WARN ][o.e.g.DanglingIndicesState] [graylog] gateway.auto_import_dangling_indices is disabled, dangling indices will not be automatically detected or imported and must be managed manually
[2021-08-02T15:17:01,096][INFO ][o.e.n.Node               ] [graylog] initialized
[2021-08-02T15:17:01,097][INFO ][o.e.n.Node               ] [graylog] starting ...
[2021-08-02T15:17:01,186][INFO ][o.e.t.TransportService   ] [graylog] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2021-08-02T15:17:01,413][WARN ][o.e.b.BootstrapChecks    ] [graylog] the default discovery settings are unsuitable for production use; at least one of [discovery.seed_hosts, discovery.seed_providers, cluster.initial_master_nodes] must be configured
[2021-08-02T15:17:01,415][INFO ][o.e.c.c.Coordinator      ] [graylog] cluster UUID [kiupN748Qg-WIaaUGCeZ8g]
[2021-08-02T15:17:01,423][INFO ][o.e.c.c.ClusterBootstrapService] [graylog] no discovery configuration found, will perform best-effort cluster bootstrapping after [3s] unless existing master is discovered
[2021-08-02T15:17:01,493][INFO ][o.e.c.s.MasterService    ] [graylog] elected-as-master ([1] nodes joined)[{graylog}{v7mjcc1SQ4q4YCv1NCBxBQ}{5ljDAR1uQA2UQLE2CPLMRw}{127.0.0.1}{127.0.0.1:9300}{dimr} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 23, version: 772, delta: master node changed {previous [], current [{graylog}{v7mjcc1SQ4q4YCv1NCBxBQ}{5ljDAR1uQA2UQLE2CPLMRw}{127.0.0.1}{127.0.0.1:9300}{dimr}]}
[2021-08-02T15:17:01,549][INFO ][o.e.c.s.ClusterApplierService] [graylog] master node changed {previous [], current [{graylog}{v7mjcc1SQ4q4YCv1NCBxBQ}{5ljDAR1uQA2UQLE2CPLMRw}{127.0.0.1}{127.0.0.1:9300}{dimr}]}, term: 23, version: 772, reason: Publication{term=23, version=772}
[2021-08-02T15:17:01,563][INFO ][o.e.h.AbstractHttpServerTransport] [graylog] publish_address {127.0.0.1:9200}, bound_addresses {[::1]:9200}, {127.0.0.1:9200}
[2021-08-02T15:17:01,563][INFO ][o.e.n.Node               ] [graylog] started
[2021-08-02T15:17:01,751][INFO ][o.e.g.GatewayService     ] [graylog] recovered [13] indices into cluster_state
[2021-08-02T15:17:03,866][INFO ][o.e.c.r.a.AllocationService] [graylog] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[graylog_0][0]]]).```

Looks like things are working just fine–I’d be more curious about why the cluster is going from red to green, but you can get that using the /allocation endpoint using:

curl -XGET localhost:9200/_cluster/allocation/explain?pretty

But i also noticed that ES is taking the 1GB of heap–how much data are you ingesting and have you tried to increase the heap to see if that makes a difference?

Thanks - I ran that and it returned this:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unable to find any unassigned shards to explain [ClusterAllocationExplainRequest[useAnyUnassignedShard=true,includeYesDecisions?=false]"
  },
  "status" : 400
}

I still see this when going to System/Overview

I’ll try increasing heap size. I don’t think we’re ingesting a ton. There are 4 servers monitored right now.

So how do you have Graylog and Elasticsearch deployed? Are they on the same box? What’s the RAM available to the OS? What’s the RAM set for GL/ES respectively?

Same VM per the install guide.
RAM available to the OS is 16GB (starting low for the time being, will ramp up as we get it where we want it)

As for what the RAM allowed to GL and ES respectively, I’ll need to find out where that’s defined.

After increasing the heap size I got Elasticsearch working.
I think it may be working again. I’ll monitor and make sure.

Gotcha. So if you’re on a 16GB node, you can safely give both GL & ES 5-6GB and chug right on along. Graylog takes the heap configuration in:

  • /etc/default/graylog-server (Ubuntu)
  • /etc/sysconfig/graylog-server (RHEL)

And Elasticsearch uses:

  • /etc/default/elasticsearch (Ubuntu)
  • /etc/sysconfig/elasticsearch (RHEL) OR
  • /etc/elasticsearch/jvm.options

I’d stick with the default/sysconfig locations for consistency.

2 Likes

Thanks for the info!

Here’s something odd - I have no events since I made the changes.

Given that one of the servers monitored generates a LOT of log traffic (respective to the others, anyway) I would expect things to show up past 10am.

What do your inputs look like–are they running just fine?

Looks to be, yeah.

Chimming in,

I noticed your using TLS but you dont have it enabled?

@gsmith @aaronsachs
Ok - It really was just choking on what was being ingested.
I increased heap size on everything to something like 5gb and added more vram to the VM.

Looks like those TLS values were defaults because it only works if TLS is disabled there.
That being said, that’s my next step is to work on getting that set up and enabled.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.