Custom beat agent in collector sidecar got into hung state

Hi All,

We have built the custom beat agent(Azurebeat) which connects to azure environment and pull the azure metrics and azure alerts from the respective subscriptions. The azurebeat is included in the graylog collector sidecar and we have created log collectors and configuation for the azurebeat in graylog & below is the config used to pull the data,

# Needed for Graylog
fields_under_root: true
fields.collector_node_id: ${sidecar.nodeId}
fields.collector_node_name: ${sidecar.nodeName}
fields.gl2_source_collector: ${sidecar.nodeId}

#==========================  Modules configuration ============================

azurebeat.modules:

#----------------------------- azurealerts Module ----------------------------
- module: azurealerts
  metricsets: ["azurealerts"]
  enabled: true
  period: 5m
  authUrl: 'https://login.microsoftonline.com/demolive.onmicrosoft.com/oauth2/token'
  clientId: '3423-ad434-adpe23-9pq323m'
  clientSecret: 'pTe234lspwen2357lQrsp09='
  subscriptionId: 'f2sdf-0282-d4j5-2kp7-9er4k5d4k'
  scopes: 
    - f2sdf-0282-d4j5-2kp7-9er4k5d4k
    
#---------------------------- azuremetrics Module ----------------------------
- module: azuremetrics
  metricsets: ["azuremetrics"]
  enabled: true
  period: 5m
  aggregationWindow: 5m
  authUrl: 'https://login.microsoftonline.com/demolive.onmicrosoft.com/oauth2/token'
  clientId: '3423-ad434-adpe23-9pq323m'
  clientSecret: 'pTe234lspwen2357lQrsp09='
  subscriptionId: 'f2sdf-0282-d4j5-2kp7-9er4k5d4k'
  monitors:
    - namespace: "Microsoft.ClassicCompute/virtualMachines"
      resource:
        - rcdemo/rcdemo
      dataPoints:
        - Percentage CPU
    - namespace: "Microsoft.Compute/virtualMachines"
      resource:
        - ElasticDev/elastic
      dataPoints:
        - Percentage CPU

#==================== Elasticsearch template setting =========================
output.logstash:
   hosts: ["graylog3-svc:5044"]
path:
  data: /var/lib/graylog-sidecar/collectors/azurebeat/data-dev
  logs: /var/lib/graylog-sidecar/collectors/azurebeat/log-dev

the data were coming continuously but after sometime the collector (azurebeat) got into hung state and we didnt get any data from them. To get the data I need to restart the collector in graylog UI. We didnt get any errors in graylog or collector sidecar log and this issue is not happening frequently.

Please let me know your thoughts how we can resolve this issue.

Thanks,
Ganeshbabu R

@Ganeshbabu
Hello,
Thanks for the information you provided. Before I can help you troubleshoot this issue I would like to ask you some questions.

  1. What version for Graylog do you have?
  2. What version of elasticsearch do you have?
  3. What version of MongoDb do you have?
  4. What does Graylog, Elasticsearch, and MongoDb logs look like when this happens?
  5. Is this a free version of Graylog or is it Enterpise Version?
  6. What kind of resources to you have for your Graylog Server (i.e. CPU. RAM, and Storage)
  7. Are you running a virtual Machine or is this physical hardware?
  8. What operating system are you using?
  9. What kind of installation do you have? (OVA, Package, Docker, etc…)

Thanks

Sure @gsmith

Please find the below response,

Graylog version is 3.3.2
Elasticsearch version is 6.8.1
Mongodb version is v3.6.23
We are using open source version of Graylog.
K8s cluster size is 8 cpu core, 32 GB RAM,
Below is the resources we have allocated for Graylog server,

Limits:
  cpu:     1
  memory:  2Gi
Requests:
  cpu:      500m
  memory:   1536Mi

Volume for graylog storage I have allocated 10Gi

I am running graylog, elasticsearch, mongo in azure kubernetes cluster (AKS)
I am using linux based OS and I am running Graylog using helm charts.

(charts/charts/graylog at main · KongZ/charts · GitHub)

we have build docker image for graylog collector sidecar and the collector (azurebeat) is part of the image. the image is deployed as pod in the k8s cluster and its talking to graylog. The above configuration I have mapped to the collector to start getting the data from azure envrionments.

Below is the graylog logs which I got on that day when issue occured,

2021-04-03 13:48:00,155 WARN    [GelfCodec] - GELF message <3425ef20-9483-11eb-b3f5-4a214cf9ce5b> (received from <10.244.3.253:36734>) is missing mandatory "host" field. - {}
2021-04-03 14:02:12,626 ERROR   [AnyExceptionClassMapper] - Unhandled exception in REST resource - {}
com.mongodb.MongoCommandException: Command failed with error 66 (ImmutableField): 'After applying the update, the (immutable) field '_id' was found to have been altered to _id: ObjectId('606875642ee20b6ec9e219cb')' on server igedge-mongodb-edge-headless:27017. The full response is {"operationTime": {"$timestamp": {"t": 1617458532, "i": 28}}, "ok": 0.0, "errmsg": "After applying the update, the (immutable) field '_id' was found to have been altered to _id: ObjectId('606875642ee20b6ec9e219cb')", "code": 66, "codeName": "ImmutableField", "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1617458532, "i": 28}}, "signature": {"hash": {"$binary": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "$type": "00"}, "keyId": {"$numberLong": "0"}}}}
        at com.mongodb.internal.connection.ProtocolHelper.getCommandFailureException(ProtocolHelper.java:175) ~[graylog.jar:?]
        at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:303) ~[graylog.jar:?]
        at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:259) ~[graylog.jar:?]
        at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:99) ~[graylog.jar:?]
        at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:450) ~[graylog.jar:?]
        at com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:72) ~[graylog.jar:?]
        at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:226) ~[graylog.jar:?]
        at com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:269) ~[graylog.jar:?]
        at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:131) ~[graylog.jar:?]
        at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:123) ~[graylog.jar:?]
        at com.mongodb.operation.CommandOperationHelper$13.call(CommandOperationHelper.java:710) ~[graylog.jar:?]
        at com.mongodb.operation.OperationHelper.withReleasableConnection(OperationHelper.java:501) ~[graylog.jar:?]
        at com.mongodb.operation.CommandOperationHelper.executeRetryableCommand(CommandOperationHelper.java:703) ~[graylog.jar:?]
        at com.mongodb.operation.CommandOperationHelper.executeRetryableCommand(CommandOperationHelper.java:695) ~[graylog.jar:?]
        at com.mongodb.operation.BaseFindAndModifyOperation.execute(BaseFindAndModifyOperation.java:70) ~[graylog.jar:?]
        at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:213) ~[graylog.jar:?]
        at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:182) ~[graylog.jar:?]
        at com.mongodb.DBCollection.findAndModify(DBCollection.java:2036) ~[graylog.jar:?]
        at com.mongodb.DBCollection.findAndModify(DBCollection.java:1891) ~[graylog.jar:?]
        at com.mongodb.DBCollection.findAndModify(DBCollection.java:1858) ~[graylog.jar:?]
        at com.mongodb.DBCollection.findAndModify(DBCollection.java:1801) ~[graylog.jar:?]
        at org.mongojack.JacksonDBCollection.findAndModify(JacksonDBCollection.java:881) ~[graylog.jar:?]
        at org.graylog.plugins.sidecar.services.ActionService.saveAction(ActionService.java:73) ~[graylog.jar:?]
        at org.graylog.plugins.sidecar.rest.resources.AdministrationResource.setAction(AdministrationResource.java:166)~[graylog.jar:?]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
        at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
        at org.glassfish.jersey.server.model.internal.ResourceMethodInvocationHandlerFactory$1.invoke(ResourceMethodInvocationHandlerFactory.java:81) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher$1.run(AbstractJavaResourceMethodDispatcher.java:144) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.invoke(AbstractJavaResourceMethodDispatcher.java:161) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.internal.JavaResourceMethodDispatcherProvider$ResponseOutInvoker.doDispatch(JavaResourceMethodDispatcherProvider.java:160) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.internal.AbstractJavaResourceMethodDispatcher.dispatch(AbstractJavaResourceMethodDispatcher.java:99) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.invoke(ResourceMethodInvoker.java:389) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:347) ~[graylog.jar:?]
        at org.glassfish.jersey.server.model.ResourceMethodInvoker.apply(ResourceMethodInvoker.java:102) ~[graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime$2.run(ServerRuntime.java:326) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:271) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors$1.call(Errors.java:267) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:315) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:297) [graylog.jar:?]
        at org.glassfish.jersey.internal.Errors.process(Errors.java:267) [graylog.jar:?]
        at org.glassfish.jersey.process.internal.RequestScope.runInScope(RequestScope.java:317) [graylog.jar:?]
        at org.glassfish.jersey.server.ServerRuntime.process(ServerRuntime.java:305) [graylog.jar:?]
        at org.glassfish.jersey.server.ApplicationHandler.handle(ApplicationHandler.java:1154) [graylog.jar:?]
        at org.glassfish.jersey.grizzly2.httpserver.GrizzlyHttpContainer.service(GrizzlyHttpContainer.java:384) [graylog
.jar:?]
        at org.glassfish.grizzly.http.server.HttpHandler$1.run(HttpHandler.java:224) [graylog.jar:?]
        at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181) [graylog.jar:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_252]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_252]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
2021-04-03 14:18:22,068 INFO    [IndexRangesCleanupPeriodical] - Removing index range information for unavailable indices: [test_discovery_nodes_0] - {}
2021-04-03 15:18:22,203 INFO    [IndexRangesCleanupPeriodical] - Removing index range information for unavailable indices: [test_discovery_nodes_0] - {}
2021-04-03 15:47:00,097 WARN    [GelfCodec] - GELF message <d3e3ff10-9493-11eb-b3f5-4a214cf9ce5b> (received from <10.244.3.253:33092>) is missing mandatory "host" field. - {}
2021-04-03 16:04:00,111 WARN    [GelfCodec] - GELF message <33dd2bb0-9496-11eb-b3f5-4a214cf9ce5b> (received from <10.244.3.253:44646>) is missing mandatory "host" field. - {}
2021-04-03 16:18:22,143 INFO    [IndexRangesCleanupPeriodical] - Removing index range information for unavailable indices: [test_discovery_nodes_0] - {}
2021-04-03 16:45:00,103 WARN    [GelfCodec] - GELF message <ee20a290-949b-11eb-b3f5-4a214cf9ce5b> (received from <10.244.3.253:44382>) is missing mandatory "host" field. - {}
2021-04-03 17:18:22,196 INFO    [IndexRangesCleanupPeriodical] - Removing index range information for unavailable indices: [test_discovery_nodes_0] - {}
2021-04-03 18:18:22,766 INFO    [IndexRangesCleanupPeriodical] - Removing index range information for unavailable indices: [test_discovery_nodes_0] - {}
2021-04-03 19:18:23,472 INFO    [IndexRangesCleanupPeriodical] - Removing index range information for unavailable indices: [test_discovery_nodes_0] - {}
2021-04-03 20:18:22,501 INFO    [IndexRangesCleanupPeriodical] - Removing index range information for unavailable indices: [test_discovery_nodes_0] - {}
2021-04-03 21:18:22,342 INFO    [IndexRangesCleanupPeriodical] - Removing index range information for unavailable indices: [test_discovery_nodes_0] - {}
2021-04-03 22:12:47,208 WARN    [GelfCodec] - GELF message <b89d8380-94c9-11eb-b3f5-4a214cf9ce5b> (received from <10.244.4.41:50704>) has invalid "timestamp": 2021-04-03 22:00:00+00:00  (type: STRING) - {}
2021-04-03 22:18:21,872 INFO    [IndexRangesCleanupPeriodical] - Removing index range information for unavailable indices: [test_discovery_nodes_0] - {}
2021-04-03 23:18:22,538 INFO    [IndexRangesCleanupPeriodical] - Removing index range information for unavailable indices: [test_discovery_nodes_0] - {}

but for elastic and mongo log I didnt get any error messages and I got these type of messages on that day,

mongo logs,

2021-04-06T05:13:31.042+0000 I NETWORK  [listener] connection accepted from 10.244.4.33:58808 #578545 (63 connections now open)
2021-04-06T05:13:31.042+0000 I NETWORK  [conn578545] received client metadata from 10.244.4.33:58808 conn578545: { driver: { name: "PyMongo", version: "3.11.3" }, os: { type: "Linux", name: "Linux", architecture: "x86_64", version: "4.15.0-1108-azure" }, platform: "CPython 3.6.9.final.0" }
2021-04-06T05:13:31.051+0000 I STORAGE  [conn578545] createCollection: edge_console.mongo_test with generated UUID: ec9bf198-4682-47b9-a42b-b8cd09245391
2021-04-06T05:13:31.110+0000 I COMMAND  [conn578545] CMD: drop edge_console.mongo_test
2021-04-06T05:13:31.110+0000 I STORAGE  [conn578545] dropCollection: edge_console.mongo_test (ec9bf198-4682-47b9-a42b-b8cd09245391) - renaming to drop-pending collection: edge_console.system.drop.1617686011i6t3.mongo_test with drop optime { ts: Timestamp(1617686011, 6), t: 3 }
2021-04-06T05:13:31.175+0000 I REPL     [replication-3350] Completing collection drop for edge_console.system.drop.1617686011i6t3.mongo_test with drop optime { ts: Timestamp(1617686011, 6), t: 3 } (notification optime: { ts: Timestamp(1617686011, 6), t: 3 })
2021-04-06T05:13:31.175+0000 I STORAGE  [replication-3350] Finishing collection drop for edge_console.system.drop.1617686011i6t3.mongo_test (ec9bf198-4682-47b9-a42b-b8cd09245391).
2021-04-06T05:13:32.486+0000 I NETWORK  [listener] connection accepted from 127.0.0.1:36478 #578546 (64 connections nowopen)
2021-04-06T05:13:32.488+0000 I NETWORK  [conn578546] received client metadata from 127.0.0.1:36478 conn578546: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "3.6.23" }, os: { type: "Linux", name: "PRETTY_NAME="Debian GNU/Linux 9 (stretch)"", architecture: "x86_64", version: "Kernel 4.15.0-1106-azure" } }
2021-04-06T05:13:32.492+0000 I NETWORK  [conn578546] end connection 127.0.0.1:36478 (63 connections now open)
2021-04-06T05:13:33.640+0000 I NETWORK  [listener] connection accepted from 127.0.0.1:36500 #578547 (64 connections now open)
2021-04-06T05:13:33.641+0000 I NETWORK  [conn578547] received client metadata from 127.0.0.1:36500 conn578547: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "3.6.23" }, os: { type: "Linux", name: "PRETTY_NAME="Debian GNU/Linux 9 (stretch)"", architecture: "x86_64", version: "Kernel 4.15.0-1106-azure" } }
2021-04-06T05:13:33.647+0000 I NETWORK  [conn578547] end connection 127.0.0.1:36500 (63 connections now open)

Elastic logs,

[2021-04-06T05:19:14,303][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [auto_generate_phrase_queries] used, replaced by [This setting is ignored, use [type=phrase] instead to make phrase queries out of all text that is within query operators, or use explicitly quoted strings if you need finer-grained control]
[2021-04-06T05:19:14,303][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [split_on_whitespace] used, replaced by [This setting is ignored, the parser always splits on operator]
[2021-04-06T05:19:14,303][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [disable_coord] used, replaced by [disable_coord has been removed]
[2021-04-06T05:19:14,303][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [disable_coord] used, replaced by [disable_coord has been removed]
[2021-04-06T05:19:14,303][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [disable_coord] used, replaced by [disable_coord has been removed]
[2021-04-06T05:19:14,303][WARN ][o.e.d.s.a.I.Parser       ] [elasticsearch-master-1] Deprecated aggregation order key [_term] used, replaced by [_key]
[2021-04-06T05:19:28,212][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [use_dis_max] used, replaced by [Set [tie_breaker] to 1 instead]
[2021-04-06T05:19:28,212][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [auto_generate_phrase_queries] used, replaced by [This setting is ignored, use [type=phrase] instead to make phrase queries out of all text that is within query operators, or use explicitly quoted strings if you need finer-grained control]
[2021-04-06T05:19:28,212][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [split_on_whitespace] used, replaced by [This setting is ignored, the parser always splits on operator]
[2021-04-06T05:19:28,212][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [disable_coord] used, replaced by [disable_coord has been removed]
[2021-04-06T05:19:28,212][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [disable_coord] used, replaced by [disable_coord has been removed]
[2021-04-06T05:19:28,212][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [disable_coord] used, replaced by [disable_coord has been removed]
[2021-04-06T05:19:28,212][WARN ][o.e.d.s.a.I.Parser       ] [elasticsearch-master-1] Deprecated aggregation order key [_term] used, replaced by [_key]
[2021-04-06T05:19:28,270][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [use_dis_max] used, replaced by [Set [tie_breaker] to 1 instead]
[2021-04-06T05:19:28,270][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [auto_generate_phrase_queries] used, replaced by [This setting is ignored, use [type=phrase] instead to make phrase queries out of all text that is within query operators, or use explicitly quoted strings if you need finer-grained control]
[2021-04-06T05:19:28,270][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [split_on_whitespace] used, replaced by [This setting is ignored, the parser always splits on operator]
[2021-04-06T05:19:28,270][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [disable_coord] used, replaced by [disable_coord has been removed]
[2021-04-06T05:19:28,270][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [disable_coord] used, replaced by [disable_coord has been removed]
[2021-04-06T05:19:28,270][WARN ][o.e.d.c.ParseField       ] [elasticsearch-master-1] Deprecated field [disable_coord] used, replaced by [disable_coord has been removed]
[2021-04-06T05:19:28,270][WARN ][o.e.d.s.a.I.Parser       ] [elasticsearch-master-1] Deprecated aggregation order key [_term] used, replaced by [_key]
[2021-04-06T05:19:28,270][WARN ][o.e.d.s.a.I.Parser       ] [elasticsearch-master-1] Deprecated aggregation order key [_term] used, replaced by [_key]

@Ganeshbabu
Hello,
Thanks for the added Information.
I’ve looked over your Logs , configurations and didn’t really see what the problem could be. Not seeing the whole picture of the logs that were generated while the Collector sidecar was in a hung state kind of hard to tell where the issue might be. Have you tried to fix the warnings and errors that were in the log files?

Just a couple of ideas:

  1. Have you tried to increase your allocated resources for Graylog server?
  2. Are you giving enough resources to Elasticsearch?
  3. Do you have Selinux or Apparmor Enabled?

As you stated/shown, the logs are not showing anything that I could see might be the issue. So I’m assuming it might be with the allocated resources for Graylog or could be another service interfering with sidecar, but maybe someone else would know here.

I’m not familiar with container installations, I’ve always had some type a problem with them. I’m also unfamiliar with helm charts.
Maybe someone else here that knows more about your environment would be able to assist you.
Sorry I can’t be more help.